Cannot create parquet with snappy output for hive external table

2017-05-16 Thread Dhimant
Hi Group,

I am not able to load data into external hive table which is partitioned.

Trace :-

1. create external table test(id int, name string) stored as parquet
location 'hdfs://testcluster/user/abc/test' tblproperties
('PARQUET.COMPRESS'='SNAPPY');

2.Spark code

   val spark =
SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition",
"true")
  .config("hive.exec.dynamic.partition.mode",
"nonstrict").getOrCreate()
spark.sql("use default").show
val rdd = sc.parallelize(Seq((1, "one"), (2, "two")))
val df = spark.createDataFrame(rdd).toDF("id", "name")
df.write.mode(SaveMode.Overwrite).insertInto("test")

3. I can see few snappy.parquet files.

4. create external table test(id int) partitioned by  (name string)  stored
as parquet location 'hdfs://testcluster/user/abc/test' tblproperties
('PARQUET.COMPRESS'='SNAPPY');

5.Spark code

   val spark =
SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition",
"true")
  .config("hive.exec.dynamic.partition.mode",
"nonstrict").getOrCreate()
spark.sql("use default").show
val rdd = sc.parallelize(Seq((1, "one"), (2, "two")))
val df = spark.createDataFrame(rdd).toDF("id", "name")
df.write.mode(SaveMode.Overwrite).insertInto("test")

6. I see uncompressed files without snappy.parquet extension.
parquet-tools.jar also confirms that this is uncompressed parquet file.

7.i tried following options as well, but no luck

df.write.mode(SaveMode.Overwrite).format("parquet").option("compression",
"snappy").insertInto("test")


Thanks in advance.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-create-parquet-with-snappy-output-for-hive-external-table-tp28687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: problems with checkpoint and spark sql

2016-09-21 Thread Dhimant
Hi David,

You got any solution for this ? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problems-with-checkpoint-and-spark-sql-tp26080p27773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bad Digest error while doing aws s3 put

2016-02-06 Thread Dhimant
Hi , I am getting the following error while reading the huge data from S3 and
after processing ,writing data to S3 again.

Did you find any solution for this ?

16/02/07 07:41:59 WARN scheduler.TaskSetManager: Lost task 144.2 in stage
3.0 (TID 169, ip-172-31-7-26.us-west-2.compute.internal):
java.io.IOException: exception in uploadSinglePart
at
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:248)
at
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:469)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
at
org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106)
at java.io.FilterOutputStream.close(FilterOutputStream.java:160)
at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
at
org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1080)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: exception in putObject
at
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:149)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy26.storeFile(Unknown Source)
at
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:245)
... 15 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The
Content-MD5 you specified did not match what we received. (Service: Amazon
S3; Status Code: 400; Error Code: BadDigest; Request ID: 5918216A5901FCC8),
S3 Extended Request ID:
QSxtYln/yXqHYpdr4BWosin/TAFsGlK1FlKfE5PcuJkNrgoblGzTNt74kEhuNcrJCRZ3mXq0oUo=
at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3796)
at
com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1482)
at
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:140)
... 22 more





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p26167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Explanation streaming-cep-engine with example

2015-03-25 Thread Dhimant
Hi,
Can someone explain how spark streaming cep engine works ?
How to use it with sample example?

http://spark-packages.org/package/Stratio/streaming-cep-engine



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Explanation-streaming-cep-engine-with-example-tp22218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error while Insert data into hive table via spark

2015-03-18 Thread Dhimant
Hi,

I have configured apache spark 1.3.0 with hive 1.0.0 and hadoop 2.6.0.
I am able to create table and retrive data from hive tables via following
commands ,but not able insert data into table.

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS newtable (key INT)");
scala> sqlContext.sql("select * from newtable").collect;
15/03/19 02:10:20 INFO parse.ParseDriver: Parsing command: select * from
newtable
15/03/19 02:10:20 INFO parse.ParseDriver: Parse Completed

15/03/19 02:10:35 INFO scheduler.DAGScheduler: Job 0 finished: collect at
SparkPlan.scala:83, took 13.826402 s
res2: Array[org.apache.spark.sql.Row] = Array([1])


But I am not able to insert data into this table via spark shell. This
command runs perfectly fine from hive shell.

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@294fa094
// scala> sqlContext.sql("INSERT INTO TABLE newtable SELECT 1");
scala> sqlContext.sql("INSERT INTO TABLE newtable values(1)");
15/03/19 02:03:14 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/03/19 02:03:14 INFO metastore.ObjectStore: ObjectStore, initialize called
15/03/19 02:03:14 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
15/03/19 02:03:14 INFO DataNucleus.Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/03/19 02:03:14 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
15/03/19 02:03:15 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
15/03/19 02:03:16 INFO metastore.ObjectStore: Setting MetaStore object pin
classes with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/03/19 02:03:18 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/03/19 02:03:18 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
15/03/19 02:03:18 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/03/19 02:03:18 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
15/03/19 02:03:18 INFO DataNucleus.Query: Reading in results for query
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is
closing
15/03/19 02:03:18 INFO metastore.ObjectStore: Initialized ObjectStore
15/03/19 02:03:19 INFO metastore.HiveMetaStore: Added admin role in
metastore
15/03/19 02:03:19 INFO metastore.HiveMetaStore: Added public role in
metastore
15/03/19 02:03:19 INFO metastore.HiveMetaStore: No user is added in admin
role, since config is empty
15/03/19 02:03:20 INFO session.SessionState: No Tez session required at this
point. hive.execution.engine=mr.
15/03/19 02:03:20 INFO parse.ParseDriver: Parsing command: INSERT INTO TABLE
newtable values(1)
NoViableAltException(26@[])
at
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:742)
at
org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:40171)
at
org.apache.hadoop.hive.ql.parse.HiveParser.singleSelectStatement(HiveParser.java:38048)
at
org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37754)
at
org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:37654)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:36898)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:36774)
at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1338)
at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfu

Re: No suitable driver found error, Create table in hive from spark sql

2015-02-18 Thread Dhimant
Found solution from one of the post found on internet.
I updated spark/bin/compute-classpath.sh and added database connector jar
into classpath.
CLASSPATH="$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar"



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-suitable-driver-found-error-Create-table-in-hive-from-spark-sql-tp21714p21715.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



No suitable driver found error, Create table in hive from spark sql

2015-02-18 Thread Dhimant
No suitable driver found error, Create table in hive from spark sql.

I am trying to execute following example.
SPARKGIT:
spark/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala

My setup :- hadoop 1.6,spark 1.2, hive 1.0, mysql server (installed via yum
install mysql55w mysql55w-server)

I can create tables in hive from hive command prompt.
/
hive> select * from person_parquet;
OK
Barack  Obama   M
BillClinton M
Hillary Clinton F
Time taken: 1.945 seconds, Fetched: 3 row(s)
/

I am starting spark shell via following command:-

./spark-1.2.0-bin-hadoop2.4/bin/spark-shell --master
spark://sparkmaster.company.com:7077 --jars
/data/mysql-connector-java-5.1.14-bin.jar

/scala> Class.forName("com.mysql.jdbc.Driver")
res0: Class[_] = class com.mysql.jdbc.Driver

scala> Class.forName("com.mysql.jdbc.Driver").newInstance
res1: Any = com.mysql.jdbc.Driver@2dec8e27

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@32ecf100

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
STRING)")
15/02/18 22:23:01 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
NOT EXISTS src (key INT, value STRING)
15/02/18 22:23:02 INFO parse.ParseDriver: Parse Completed
15/02/18 22:23:02 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/18 22:23:02 INFO metastore.ObjectStore: ObjectStore, initialize called
15/02/18 22:23:02 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
15/02/18 22:23:02 INFO DataNucleus.Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
15/02/18 22:23:02 ERROR Datastore.Schema: Failed initialising database.
No suitable driver found for jdbc:mysql://sparkmaster.company.com:3306/hive
org.datanucleus.exceptions.NucleusDataStoreException: No suitable driver
found for jdbc:mysql://sparkmaster.company.com:3306/hive
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
at
org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at
org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at
org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.R

Re: ExecutorLostFailure (executor lost)

2014-10-10 Thread Dhimant
I also received the same error message quite few times while saving rdd to
hdfs. 
I am using Spark 1.1.0 with hadoop 2.5 in yarn mode.

If you see logs, you might find logs like followings.

14/10/10 14:20:21 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(6, sparkmaster.company.com, 44906, 0) with no
recent heart beats: 71967ms exceeds 45000ms
14/10/10 14:31:15 INFO scheduler.TaskSetManager: Finished task 46.0 in stage
4.0 (TID 544) in 734650 ms on sparknode1.company.com (46/50)
14/10/10 14:55:31 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(sparkmaster.company.com,44906)
14/10/10 14:55:31 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(sparkmaster.company.com,44906)
14/10/10 14:55:31 INFO network.ConnectionManager: Removing SendingConnection
to ConnectionManagerId(sparkmaster.company.com,44906)
14/10/10 14:55:31 INFO cluster.YarnClusterSchedulerBackend: Executor 6
disconnected, so removing it
14/10/10 14:55:31 ERROR cluster.YarnClusterScheduler: Lost executor 6 on
sparkmaster.company.com: remote Akka client disassociated
14/10/10 14:55:31 INFO scheduler.TaskSetManager: Re-queueing tasks for 6
from TaskSet 4.0
14/10/10 14:55:31 WARN scheduler.TaskSetManager: Lost task 45.0 in stage 4.0
(TID 543, sparkmaster.company.com): ExecutorLostFailure (executor lost)
14/10/10 14:55:31 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 10)
14/10/10 14:55:31 INFO storage.BlockManagerMasterActor: Trying to remove
executor 6 from BlockManagerMaster.
14/10/10 14:55:31 INFO storage.BlockManagerMaster: Removed 6 successfully in
removeExecutor
14/10/10 14:55:31 INFO scheduler.TaskSetManager: Starting task 45.1 in stage
4.0 (TID 548, sparknode1.company.com, PROCESS_LOCAL, 948 bytes) 

If you are using yarn ,it will reschedule it again and start further
processing.

You can try updating following attributes from spark-defaults.conf
spark.core.connection.ack.wait.timeout  3600
spark.core.connection.auth.wait.timeout 3600



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ExecutorLostFailure-executor-lost-tp14117p16126.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Change number of workers and memory

2014-09-22 Thread Dhimant
I am having a spark cluster having some high performance nodes and others are
having commodity specs (lower configuration). 
When I configure worker memory and instances in spark-env.sh, it reflects to
all the nodes.
Can I change SPARK_WORKER_MEMORY and SPARK_WORKER_INSTANCES properties per
node/machine basis ?
I am using Spark 1.1.0 version.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Change-number-of-workers-and-memory-tp14866.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error: type mismatch while Union

2014-09-07 Thread Dhimant
Thank you Aaron for pointing out problem. This only happens when I run this
code in spark-shell but not when i submit the job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error: type mismatch while Union

2014-09-06 Thread Dhimant
I am using Spark version 1.0.2




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



error: type mismatch while Union

2014-09-05 Thread Dhimant
Hi,
I am getting type mismatch error while union operation.
Can someone suggest solution ?

  / case class MyNumber(no: Int, secondVal: String) extends Serializable
with Ordered[MyNumber] {
  override def toString(): String = this.no.toString + " " +
this.secondVal
  override def compare(that: MyNumber): Int = this.no compare that.no
  override def compareTo(that: MyNumber): Int = this.no compare that.no
  def Equals(that: MyNumber): Boolean = {
(this.no == that.no) && (that match {
  case MyNumber(n1, n2) => n1 == no && n2 == secondVal
  case _ => false
})
  }
}
val numbers = sc.parallelize(1 to 20, 10)
val firstRdd = numbers.map(new MyNumber(_, "A"))
val secondRDD = numbers.map(new MyNumber(_, "B"))
val numberRdd = firstRdd .union(secondRDD )
:24: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[MyNumber]
 required: org.apache.spark.rdd.RDD[MyNumber]
   val numberRdd = onenumberRdd.union(anotherRDD)/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multiple spark shell sessions

2014-09-04 Thread Dhimant
Thanks Yana,
I am able to execute application and command via another session, i also
received another port for UI application.

Thanks,
Dhimant



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-spark-shell-sessions-tp13441p13459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Multiple spark shell sessions

2014-09-04 Thread Dhimant
Hi,
I am receiving following error while connecting the spark server via shell
if one shell is already open.
How can I open multiple sessions ?

Does anyone know abt Workflow Engine/Job Server like apache oozie for spark
?

/
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
14/09/04 15:07:46 INFO spark.SecurityManager: Changing view acls to: root
14/09/04 15:07:46 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(root)
14/09/04 15:07:46 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/09/04 15:07:47 INFO Remoting: Starting remoting
14/09/04 15:07:47 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sp...@sparkmaster.guavus.com:42236]
14/09/04 15:07:47 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sp...@sparkmaster.guavus.com:42236]
14/09/04 15:07:47 INFO spark.SparkEnv: Registering MapOutputTracker
14/09/04 15:07:47 INFO spark.SparkEnv: Registering BlockManagerMaster
14/09/04 15:07:47 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140904150747-4dcd
14/09/04 15:07:47 INFO storage.MemoryStore: MemoryStore started with
capacity 294.9 MB.
14/09/04 15:07:47 INFO network.ConnectionManager: Bound socket to port 54453
with id = ConnectionManagerId(sparkmaster.guavus.com,54453)
14/09/04 15:07:47 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/04 15:07:47 INFO storage.BlockManagerInfo: Registering block manager
sparkmaster.guavus.com:54453 with 294.9 MB RAM
14/09/04 15:07:47 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/04 15:07:47 INFO spark.HttpServer: Starting HTTP Server
14/09/04 15:07:47 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/04 15:07:47 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:48977
14/09/04 15:07:47 INFO broadcast.HttpBroadcast: Broadcast server started at
http://192.168.1.21:48977
14/09/04 15:07:47 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-0e45759a-2c58-439a-8e96-95b0bc1d6136
14/09/04 15:07:47 INFO spark.HttpServer: Starting HTTP Server
14/09/04 15:07:47 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/04 15:07:47 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:39962
14/09/04 15:07:48 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/04 15:07:48 WARN component.AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already
in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
at
org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at
org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
at
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
at org.apache.spark.SparkContext.(SparkContext.scala:223)
at
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
at $line3.$read$$iwC$$iwC.(:8)
at $line3.$read$$iwC.(:14)
at $line3.$read.(:16)
at $line3.$read$.(:20)
at $line3.$read$.()
at $line3.$eval$.(:7)
at $line3.$eval$.()
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala

error: type mismatch while assigning RDD to RDD val object

2014-09-03 Thread Dhimant
I am receiving following error in Spark-Shell while executing following code.

 /class LogRecrod(logLine: String) extends Serializable {
val splitvals = logLine.split(",");
val strIp: String = splitvals(0)
val hostname: String = splitvals(1)
val server_name: String = splitvals(2)
  }/

/var logRecordRdd: org.apache.spark.rdd.RDD[LogRecrod] = _/

/ val sourceFile =
sc.textFile("hdfs://192.168.1.30:9000/Data/Log_1406794333258.log", 2)/
14/09/04 12:08:28 INFO storage.MemoryStore: ensureFreeSpace(179585) called
with curMem=0, maxMem=309225062
14/09/04 12:08:28 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 175.4 KB, free 294.7 MB)
sourceFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
:12


/scala> logRecordRdd = sourceFile.map(line => new LogRecrod(line))/
/:18: error: type mismatch;
 found   : LogRecrod
 required: LogRecrod
   logRecordRdd = sourceFile.map(line => new LogRecrod(line))/

Any suggestions to resolve this problem?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-assigning-RDD-to-RDD-val-object-tp13429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant
Hi ,

I am getting java.io.NotSerializableException exception while executing
following program.

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
  class Vector (val data: Array[Double]) {}
  implicit object VectorAP extends AccumulatorParam[Vector]  { 
def zero(v: Vector) : Vector = new Vector(new Array(v.data.size)) 
def addInPlace(v1: Vector, v2: Vector) : Vector = { 
  for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i) 
  return v1 
} 
  } 
  def main(sc:SparkContext) {
val vectorAcc = sc.accumulator(new Vector(Array(0, 0)))
val accum = sc.accumulator(0) 
val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10)
file.foreach(line => {println(line); accum+=1; vectorAcc.add(new
Vector(Array(1,1 ))) ;})
println(accum.value)
println(vectorAcc.value.data)
println("=" )
  }
}

--

scala> App.main(sc)
14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called
with curMem=0, maxMem=308713881
14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 127.7 KB, free 294.3 MB)
14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at
:30
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at
:30) with 11 output partitions (allowLocal=false)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0 (foreach
at :30)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[1] at textFile at :29), which has no missing parents
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at
:30
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-exception-custom-Accumulator-tp3971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal
Hi ,

I am getting java.io.NotSerializableException exception while executing
following program.

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
  class Vector (val data: Array[Double]) {}
  implicit object VectorAP extends AccumulatorParam[Vector]  {
def zero(v: Vector) : Vector = new Vector(new Array(v.data.size))
def addInPlace(v1: Vector, v2: Vector) : Vector = {
  for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i)
  return v1
}
  }
  def main(sc:SparkContext) {
val vectorAcc = sc.accumulator(new Vector(Array(0, 0)))
val accum = sc.accumulator(0)
val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10)
file.foreach(line => {println(line); accum+=1; vectorAcc.add(new
Vector(Array(1,1 ))) ;})
println(accum.value)
println(vectorAcc.value.data)
println("=" )
  }
}

--

scala> App.main(sc)
14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called
with curMem=0, maxMem=308713881
14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 127.7 KB, free 294.3 MB)
14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at
:30
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at
:30) with 11 output partitions (allowLocal=false)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0
(foreach at :30)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[1] at textFile at :29), which has no missing parents
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at
:30
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)