date:20150502

Can you try the patch from:
[SPARK-6913][SQL] Fixed java.sql.SQLException: No suitable driver found

Cheers

On Sat, Mar 28, 2015 at 12:41 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 This is from my Hive installation

 -sh-4.1$ ls /apache/hive/lib  | grep derby

 derby-10.10.1.1.jar

 derbyclient-10.10.1.1.jar

 derbynet-10.10.1.1.jar


 -sh-4.1$ ls /apache/hive/lib  | grep datanucleus

 datanucleus-api-jdo-3.2.6.jar

 datanucleus-core-3.2.10.jar

 datanucleus-rdbms-3.2.9.jar


 -sh-4.1$ ls /apache/hive/lib  | grep mysql

 mysql-connector-java-5.0.8-bin.jar

 -sh-4.1$


 $ hive --version

 Hive 0.13.0.2.1.3.6-2

 Subversion
 git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
 -r 87da9430050fb9cc429d79d95626d26ea382b96c


 $



 On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I tried with a different version of driver but same error

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
 */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files
 $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
 --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

 On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 This is what am seeing



 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
 --files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by:
 org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
 The specified datastore driver (com.mysql.jdbc.Driver) was not found in
 the CLASSPATH. Please check your CLASSPATH specification, and the name of
 the driver.




 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
 *--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1
 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G
 --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by: java.sql.SQLException: No suitable driver found for
 jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB
 at java.sql.DriverManager.getConnection(DriverManager.java:596)


 Looks like the driver jar that i got in is not correct,

 On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Could someone please share the spark-submit command that shows their
 mysql jar containing driver class used to connect to Hive MySQL meta store.

 Even after including it through

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread dsgriffin

val newRdd = myRdd.map(row = row ++ Array((row(1).toLong *
row(199).toLong).toString))



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Exiting driver main() method...

It used to exit without any problem for me. You can basically check in the
driver UI (that runs on 4040) and see what exactly its doing.

Thanks
Best Regards

On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com
wrote:

 In all the examples, it seems that the spark application doesn't really do
 anything special in order to exit.  When I run my application, however, the
 spark-submit script just hangs there at the end.  Is there something
 special I need to do to get that thing to exit normally?

empty jdbc RDD in spark

2015-05-02 Thread Hafiz Mujadid

Hi all!
I am trying to read hana database using spark jdbc RDD
here is my code
def readFromHana() {
val conf = new SparkConf()
conf.setAppName(test).setMaster(local)
val sc = new SparkContext(conf)
val rdd = new JdbcRDD(sc, () = {
  Class.forName(com.sap.db.jdbc.Driver).newInstance()
 
DriverManager.getConnection(jdbc:sap://54.69.200.113:30015/?currentschema=LIVE2,
mujadid, 786Xyz123)
},
  SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,
  0, 100, 1,
  (r: ResultSet) =  convert(r) )
println(rdd.count());
sc.stop()
  }
  def convert(rs: ResultSet):String={
  val rsmd = rs.getMetaData()
  val numberOfColumns = rsmd.getColumnCount()
  var i = 1
  val row=new StringBuilder
  while (i = numberOfColumns) {
row.append( rs.getString(i)+,)
i += 1
  }
  row.toString()
   }

The resultant count is 0

Any suggestion?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Drop a column from the DataFrame.

2015-05-02 Thread dsgriffin

Just use select() to create a new DataFrame with only the columns you want.
Sort of the opposite of what you want -- but you can select all but the
columns you want minus the one you don. You could even use a filter to
remove just the one column you want on the fly:

myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
= new Column(colname)).toList : _* )



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark.logConf with log4j.rootCategory=WARN

It could be.

Thanks
Best Regards

On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote:

 Hi,

   I have recently enable log4j.rootCategory=WARN, console in spark
 configuration. but after that spark.logConf=True has becomes ineffective.

   So just want to confirm if this is because  log4j.rootCategory=WARN ?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-logConf-with-log4j-rootCategory-WARN-tp22731.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming Kafka Avro NPE on deserialization of payload

There was a similar discussion over here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E

Thanks
Best Regards

On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote:

 *Resending as I do not see that this made it to the mailing list, sorry if
 in fact it did an is just nor reflected online yet.*

 I’m very perplexed with the following. I have a set of AVRO generated
 objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming
 job follows the receiver-based approach. I am encountering the below
 error when I attempt to de serialize the payload:

 15/04/30 17:49:25 INFO MapOutputTrackerMasterActor: Asked to send map output 
 locations for shuffle 9 to sparkExecutor@192.168.1.3:6105115/04/30 17:49:25 
 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 140 
 bytes15/04/30 17:49:25 ERROR TaskResultGetter: Exception while getting task 
 resultcom.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 relations (com.opsdatastore.model.ObjectDetails)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.NullPointerException
 at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 ... 17 more15/04/30 17:49:25 INFO TaskSchedulerImpl: Removed TaskSet 20.0, 
 whose tasks have all completed, from pool

 Basic code looks like this.

 Register the class with Kryo as follows:

 val sc = new SparkConf(true)
   .set(spark.streaming.unpersist, true)
   .setAppName(StreamingKafkaConsumer)
   .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

 // register all related AVRO generated classes
 sc.registerKryoClasses(Array(
 classOf[ConfigurationProperty],
 classOf[Event],
 classOf[Identifier],
 classOf[Metric],
 classOf[ObjectDetails],
 classOf[Relation],
 classOf[RelationProperty]
 ))

 Use the receiver based approach to consume messages from Kafka:

  val messages = KafkaUtils.createStream[Array[Byte], Array[Byte], 
 DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics, storageLevel)

 Now process the received messages:

 val raw = messages.map(_._2)
 val dStream = raw.map(
   byte = {
 // Avro Decoder
 println(Byte length:  + byte.length)
 val decoder = new AvroDecoder[ObjectDetails](schema = 
 ObjectDetails.getClassSchema)
 val message = decoder.fromBytes(byte)
 println(sAvroMessage : Type : ${message.getType}, Payload : $message)
 message
   }
 )

 When i look in the logs of the workers, in standard out i can se the
 messages being printed, in fact I’m even able to access the Type field with
 out issue:

 Byte length: 315
 AvroMessage : Type : Storage, Payload : {name: Storage 1, type: 
 Storage, vendor: 6274g51cbkmkqisk, model: lk95hqk9m10btaot, 
 timestamp: 1430428565141, identifiers: {ID: {name: ID, value: 
 Storage-1}}, configuration: null, metrics: {Disk Space Usage (GB): 
 {name: Disk Space Usage

Re: Spark worker error on standalone cluster

2015-05-02 Thread Michael Ryabtsev

Thanks Akhil,

I am trying to investigate this path. The spark is the same, but may be
there is a difference in Hadoop.

On Sat, May 2, 2015 at 6:25 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Just make sure your are having the same version of spark in your cluster
 and the project's build file.

 Thanks
 Best Regards

 On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) 
 mich...@totango.com wrote:

 Hi everyone,

 I have a spark application that works fine on a standalone Spark cluster
 that runs on my laptop
 (master and one worker), but fails when I try to run in on a standalone
 Spark cluster
 deployed on EC2 (master and worker are on different machines).
 The application structure goes in the following way:
 There is a java process ('message processor') that runs on the same
 machine
 as
 Spark master.  When it starts, it submits itself to Spark master, then,
 it listens on SQS and on each received message, it should run a spark job
 to
 process a file from S3, which address is configured in the message .
 It looks like all this fails at the point where the Spark driver tries to
 send the job
 to the Spark executer.
 Below is the code from the 'message processor' that configures the
 SparkContext,
 Then the Spark driver log, and then the Spark executor log.
 The outputs of my code and some important points are marked in bold and
 I've simplified the code and logs in some places for the sake of
 readability.
 Would appreciate your help very much, because I've run out of ideas with
 this problem.

 'message processor' code:
 ===
 ===
 ||
 logger.info(*Started Integration Hub SubmitDriver in test mode*.);

 SparkConf sparkConf = new SparkConf()
 .setMaster(SPARK_MASTER_URI)
 .setAppName(APPLICATION_NAME)
 .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE);

 sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass()));

 // configure spark executor to use log4j properties located in the local
 spark conf dir
 sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC
 -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

 sparkConf.set(spark.executor.memory, 1g);
 sparkConf.set(spark.cores.max, 3);
 // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced
 performance
 sparkConf.set(spark.shuffle.spill, true);

 logger.info(*Connecting Spark*);
 JavaSparkContext sc = new JavaSparkContext(sparkConf);

 sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY);
 sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET);

 logger.info(*Spark connected*);
 ||

 ==

 Driver log:

 ==||
 2015-05-01 07:47:14 INFO  ClassPathBeanDefinitionScanner:239 - JSR-330
 'javax.inject.Named' annotation found and supported for component scanning
 2015-05-01 07:47:14 INFO  AnnotationConfigApplicationContext:510 -
 Refreshing

 org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b
 :
 startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy
 2015-05-01 07:47:14 INFO  AutowiredAnnotationBeanPostProcessor:140 -
 JSR-330
 'javax.inject.Inject' annotation found and supported for autowiring
 2015-05-01 07:47:14 INFO  DefaultListableBeanFactory:596 -
 Pre-instantiating
 singletons in

 org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e
 :
 defining beans

 [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver];
 root of factory hierarchy
 *2015-05-01 07:47:15 INFO  SubmitDriver:69 - Started Integration Hub
 SubmitDriver in test mode.
 2015-05-01 07:47:15 INFO  SubmitDriver:101 - Connecting Spark
 *2015-05-01 07:47:15 INFO  SparkContext:59 - Running Spark version 1.3.0
 2015-05-01 07:47:16 WARN  NativeCodeLoader:62 - Unable to load
 native-hadoop
 library for your platform... using builtin-java classes where applicable
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing view acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing modify acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - SecurityManager:
 authentication disabled; ui acls

spark filestrea problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception

sparkR equivalent to SparkContext.newAPIHadoopRDD?

2015-05-02 Thread David Holiday

Hi gang,

I'm giving sparkR a test drive and am bummed to discover that the SparkContext 
API in sparkR is only a subset of what's available in stock spark. 
Specifically, I need to be able to pull data from accumulo into sparkR. I can 
do it with stock spark but can't figure out how to make the magic happen with 
sparkR. Anyone got any ideas?

thanks!


DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com


[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.AnnaiSystems.comhttp://www.AnnaiSystems.com

Problem in Standalone Mode

2015-05-02 Thread drarse

When I run my program with Spark-Submit everythink are ok. But when I try
run in satandalone mode I obtain the nex Exceptions:

((This is with

val df = sqlContext.jsonFile(./datos.json)

))
java.io.EOFException
[error] at
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)


This is SparkConf

val sparkConf = new SparkConf().setAppName(myApp)
  .setMaster(spark://master:7077)
  .setSparkHome(/usr/local/spark/)
  .setJars(Seq(./target/scala-2.10/myApp.jar))




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Standalone-Mode-tp22741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Help with publishing to Kafka from Spark Streaming?

2015-05-02 Thread Saisai Shao

Here is the pull request, you may refer to this:

https://github.com/apache/spark/pull/2994

Thanks
Jerry


2015-05-01 14:38 GMT+08:00 Pavan Sudheendra pavan0...@gmail.com:

 Link to the question:

 http://stackoverflow.com/questions/29974017/spark-kafka-producer-not-serializable-exception

 Thanks for any pointers.

Re: how to pass configuration properties from driver to executor?

Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped
to the executors.

Thanks
Best Regards

On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com
wrote:

Hi,

We've had a similar problem, but with log4j properties file.
The only working way we've found, was externally deploying the properties
file on the worker machine to the spark conf folder and configuring the
executor jvm options with:

sparkConf.set(spark.executor.extraJavaOptions,
-Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

It is done in our case from the java code, but I think it can be also a
parameter to spark-submit.

Regards,

Michael.

On Fri, May 1, 2015 at 2:26 AM, Tian Zhang tzhang...@yahoo.com wrote:

Hi,

We have a scenario as below and would like your suggestion.
We have app.conf file with propX=A as default built into the fat jar file
that is provided to spark-submit
WE have env.conf file with propX=B that would like spark-submit to take as
input to overwrite the default and populate to both driver and executors.
Note in the executor, we are using some package that is using typesafe
config to read configuration properties.

How do we do that?

Thanks.

Tian

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-pass-configuration-properties-from-driver-to-executor-tp22728.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[PSA] Use Stack Overflow!

2015-05-02 Thread Nick Chammas

This mailing list sees a lot of traffic every day.

With such a volume of mail, you may find it hard to find discussions you
are interested in, and if you are the one starting discussions you may
sometimes feel your mail is going into a black hole.

We can't change the nature of this mailing list (it is required by the
Apache foundation), but there is an alternative in Stack Overflow
http://stackoverflow.com/.

Stack Overflow has an active Apache Spark tag
http://stackoverflow.com/questions/tagged/apache-spark, and Spark
committers like Sean Owen http://stackoverflow.com/users/64174/sean-owen
and Josh Rosen http://stackoverflow.com/users/590203/josh-rosen are
active under that tag, along with several contributors and of course
regular users.

Try it out! I think when your question fits Stack Overflow's guidelines
http://stackoverflow.com/help/on-topic, it is generally better to ask it
there than on this mailing list.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PSA-Use-Stack-Overflow-tp22732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001

I have a list of cloudera jars which I need to provide in --jars clause,
mainly for the HiveContext functionality I am using. However,  many of these
jars have version number as part of their names. This leads to an issue that
the names might change when I do a Cloudera upgrade.

Just a note here, there are many jars which cloudera exposes as a symlink
which is the link to the latest version of that jar(e.g
/opt/cloudera/parcels/CDH/lib/parquet/parquet-hadoop-bundle.jar -
parquet-hadoop-bundle-1.5.0-cdh5.3.2.jar),in which case its good but there
are many jars which aren't.

Is there a flexible way to avoid this situation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Generating-version-agnostic-jar-path-value-for-jars-clause-tp22734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark worker error on standalone cluster

Just make sure your are having the same version of spark in your cluster
and the project's build file.

Thanks
Best Regards

On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) 
mich...@totango.com wrote:

 Hi everyone,

 I have a spark application that works fine on a standalone Spark cluster
 that runs on my laptop
 (master and one worker), but fails when I try to run in on a standalone
 Spark cluster
 deployed on EC2 (master and worker are on different machines).
 The application structure goes in the following way:
 There is a java process ('message processor') that runs on the same machine
 as
 Spark master.  When it starts, it submits itself to Spark master, then,
 it listens on SQS and on each received message, it should run a spark job
 to
 process a file from S3, which address is configured in the message .
 It looks like all this fails at the point where the Spark driver tries to
 send the job
 to the Spark executer.
 Below is the code from the 'message processor' that configures the
 SparkContext,
 Then the Spark driver log, and then the Spark executor log.
 The outputs of my code and some important points are marked in bold and
 I've simplified the code and logs in some places for the sake of
 readability.
 Would appreciate your help very much, because I've run out of ideas with
 this problem.

 'message processor' code:
 ===
 ===
 ||
 logger.info(*Started Integration Hub SubmitDriver in test mode*.);

 SparkConf sparkConf = new SparkConf()
 .setMaster(SPARK_MASTER_URI)
 .setAppName(APPLICATION_NAME)
 .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE);

 sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass()));

 // configure spark executor to use log4j properties located in the local
 spark conf dir
 sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC
 -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

 sparkConf.set(spark.executor.memory, 1g);
 sparkConf.set(spark.cores.max, 3);
 // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced
 performance
 sparkConf.set(spark.shuffle.spill, true);

 logger.info(*Connecting Spark*);
 JavaSparkContext sc = new JavaSparkContext(sparkConf);

 sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY);
 sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET);

 logger.info(*Spark connected*);
 ||

 ==

 Driver log:

 ==||
 2015-05-01 07:47:14 INFO  ClassPathBeanDefinitionScanner:239 - JSR-330
 'javax.inject.Named' annotation found and supported for component scanning
 2015-05-01 07:47:14 INFO  AnnotationConfigApplicationContext:510 -
 Refreshing

 org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b
 :
 startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy
 2015-05-01 07:47:14 INFO  AutowiredAnnotationBeanPostProcessor:140 -
 JSR-330
 'javax.inject.Inject' annotation found and supported for autowiring
 2015-05-01 07:47:14 INFO  DefaultListableBeanFactory:596 -
 Pre-instantiating
 singletons in

 org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e
 :
 defining beans

 [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver];
 root of factory hierarchy
 *2015-05-01 07:47:15 INFO  SubmitDriver:69 - Started Integration Hub
 SubmitDriver in test mode.
 2015-05-01 07:47:15 INFO  SubmitDriver:101 - Connecting Spark
 *2015-05-01 07:47:15 INFO  SparkContext:59 - Running Spark version 1.3.0
 2015-05-01 07:47:16 WARN  NativeCodeLoader:62 - Unable to load
 native-hadoop
 library for your platform... using builtin-java classes where applicable
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing view acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing modify acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - SecurityManager:
 authentication disabled; ui acls disabled; users with view permissions:
 Set(hadoop); users with modify permissions: Set(hadoop)
 2015-05-01 07:47:18 INFO  Slf4jLogger:80 - Slf4jLogger started
 2015-05-01 07:47:18 INFO  Remoting:74 -

spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Enabling Event Log

2015-05-02 Thread Jeetendra Gangele

is it working now?

On 1 May 2015 at 13:43, James King jakwebin...@gmail.com wrote:

 Oops! well spotted. Many thanks Shixiong.

 On Fri, May 1, 2015 at 1:25 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 spark.history.fs.logDirectory is for the history server. For Spark
 applications, they should use spark.eventLog.dir. Since you commented out
 spark.eventLog.dir, it will be /tmp/spark-events. And this folder does
 not exits.

 Best Regards,
 Shixiong Zhu

 2015-04-29 23:22 GMT-07:00 James King jakwebin...@gmail.com:

 I'm unclear why I'm getting this exception.

 It seems to have realized that I want to enable  Event Logging but
 ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which
 does exist.

 spark-default.conf

 # Example:
 spark.master spark://master1:7077,master2:7077
 spark.eventLog.enabled   true
 spark.history.fs.logDirectoryfile:/opt/cb/tmp/spark-events
 # spark.eventLog.dir   hdfs://namenode:8021/directory
 # spark.serializer
 org.apache.spark.serializer.KryoSerializer
 # spark.driver.memory  5g
 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
 -Dnumbers=one two three

 Exception following job submission:

 spark.eventLog.enabled=true
 spark.history.fs.logDirectory=file:/opt/cb/tmp/spark-events

 spark.jars=file:/opt/cb/scripts/spark-streamer/cb-spark-streamer-1.0-SNAPSHOT.jar
 spark.master=spark://master1:7077,master2:7077
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.
 at
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99)
 at org.apache.spark.SparkContext.init(SparkContext.scala:399)
 at
 org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
 at
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:75)
 at
 org.apache.spark.streaming.api.java.JavaStreamingContext.init(JavaStreamingContext.scala:132)


 Many Thanks
 jk

to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI

Hi,

I have an RDD srdd containing (unordered-)data like this:
s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …

What I want is (it will be much better if they could be in ascending order):
srdd_s1:
s1_0, s1_1, s1_2, …, s1_n
srdd_s2:
s2_0, s2_1, s2_2, …, s2_n
srdd_s3:
s3_0, s3_1, s3_2, …, s3_n
…
…

Have any idea? Thanks in advance! :)


Best,
Yifan LI

Re: DataFrame filter referencing error

2015-05-02 Thread Francesco Bigarella

First of all, thank you for your replies.

I was previously doing this via normal jdbc connection and it worked
without problems. Then I liked the idea that sparksql could take care of
opening/closing the connection.

I tried also with single quotes, since that was my first guess but didn't
work. I fear I will have to look at the spark code but I'm the only one
with this issue.

BTW I'm testing with spark 1.3.0

Best,
Francesco

On Fri, May 1, 2015, 00:54 ayan guha guha.a...@gmail.com wrote:

 I think you need to specify new in single quote. My guess is the query
 showing up in dB is like
 ...where status=new or
 ...where status=new
 Either case mysql assumes new is a column.
 What you need is the form below
 ...where status='new'

 You need to provide your quotes accordingly.

 Easiest way would be to do it would in a separate jdbc conn to mysql using
 a simple standalone programme, not in spark.
 On 1 May 2015 07:47, Burak Yavuz brk...@gmail.com wrote:

 Is new a reserved word for MySQL?

 On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella 
 francesco.bigare...@gmail.com wrote:

 Do you know how I can check that? I googled a bit but couldn't find a
 clear explanation about it. I also tried to use explain() but it doesn't
 really help.
 I still find unusual that I have this issue only for the equality
 operator but not for the others.

 Thank you,
 F

 On Wed, Apr 29, 2015 at 3:03 PM ayan guha guha.a...@gmail.com wrote:

 Looks like you DF is based on a MySQL DB using jdbc, and error is
 thrown from mySQL. Can you see what SQL is finally getting fired in MySQL?
 Spark is pushing down the predicate to mysql so its not a spark problem
 perse

 On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella 
 francesco.bigare...@gmail.com wrote:

 Hi all,

 I was testing the DataFrame filter functionality and I found what I
 think is a strange behaviour.
 My dataframe testDF, obtained loading aMySQL table via jdbc, has the
 following schema:
 root
  | -- id: long (nullable = false)
  | -- title: string (nullable = true)
  | -- value: string (nullable = false)
  | -- status: string (nullable = false)

 What I want to do is filter my dataset to obtain all rows that have a
 status = new.

 scala testDF.filter(testDF(id) === 1234).first()
 works fine (also with the integer value within double quotes), however
 if I try to use the same statement to filter on the status column (also
 with changes in the syntax - see below), suddenly the program breaks.

 Any of the following
 scala testDF.filter(testDF(status) === new)
 scala testDF.filter(status = 'new')
 scala testDF.filter($status === new)

 generates the error:

 INFO scheduler.DAGScheduler: Job 3 failed: runJob at
 SparkPlan.scala:121, took 0.277907 s

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage 3.0 (TID 12, node name):
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column
 'new' in 'where clause'

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
 at com.mysql.jdbc.Util.getInstance(Util.java:386)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625)
 at
 com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119)
 at
 com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283)
 at org.apache.spark.sql.jdbc.JDBCRDD$anon$1.init(JDBCRDD.scala:328)
 at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: empty jdbc RDD in spark

bq.   SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,

Have you tried dropping limit and offset clause from the above query ?

Cheers

On Fri, May 1, 2015 at 1:56 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi all!
 I am trying to read hana database using spark jdbc RDD
 here is my code
 def readFromHana() {
 val conf = new SparkConf()
 conf.setAppName(test).setMaster(local)
 val sc = new SparkContext(conf)
 val rdd = new JdbcRDD(sc, () = {
   Class.forName(com.sap.db.jdbc.Driver).newInstance()

 DriverManager.getConnection(jdbc:sap://
 54.69.200.113:30015/?currentschema=LIVE2,
 mujadid, 786Xyz123)
 },
   SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,
   0, 100, 1,
   (r: ResultSet) =  convert(r) )
 println(rdd.count());
 sc.stop()
   }
   def convert(rs: ResultSet):String={
   val rsmd = rs.getMetaData()
   val numberOfColumns = rsmd.getColumnCount()
   var i = 1
   val row=new StringBuilder
   while (i = numberOfColumns) {
 row.append( rs.getString(i)+,)
 i += 1
   }
   row.toString()
}

 The resultant count is 0

 Any suggestion?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Number of input partitions in SparkContext.sequenceFile

2015-05-02 Thread Archit Thakur

Hi,

How did u check no of splits in ur file. Did i run ur mr job or calculated
it.?
 The formula for split size is
max(minSize, min(max size, block size)). Can u check if it satisfies ur
case.?

Thanks  Regards,
Archit Thakur.

On Saturday, April 25, 2015, Wenlei Xie wenlei@gmail.com wrote:

 Hi,

 I checked the number of partitions by

 System.out.println(INFO: RDD with  + rdd.partitions().size() + 
 partitions created.);


 Each single split is about 100MB. I am currently loading the data from
 local file system, would this explains this observation?

 Thank you!

 Best,
 Wenlei

 On Tue, Apr 21, 2015 at 6:28 AM, Archit Thakur archit279tha...@gmail.com
 javascript:_e(%7B%7D,'cvml','archit279tha...@gmail.com'); wrote:

 Hi,

 It should generate the same no of partitions as the no. of splits.
 Howd you check no of partitions.? Also please paste your file size and
 hdfs-site.xml and mapred-site.xml here.

 Thanks and Regards,
 Archit Thakur.

 On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wenlei@gmail.com
 javascript:_e(%7B%7D,'cvml','wenlei@gmail.com'); wrote:

 Hi,

 I am wondering the mechanism that determines the number of partitions
 created by SparkContext.sequenceFile ?

 For example, although my file has only 4 splits, Spark would create 16
 partitions for it. Is it determined by the file size? Is there any way to
 control it? (Looks like I can only tune minPartitions but not maxPartitions)

 Thank you!

 Best,
 Wenlei






 --
 Wenlei Xie (谢文磊)

 Ph.D. Candidate
 Department of Computer Science
 456 Gates Hall, Cornell University
 Ithaca, NY 14853, USA
 Email: wenlei@gmail.com
 javascript:_e(%7B%7D,'cvml','wenlei@gmail.com');

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter

Thanks for your reply! It is what I am after.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: not getting any mail

Looks like there were delays across Apache project mailing lists. 

Emails are coming through now. 



 On May 2, 2015, at 9:14 AM, Jeetendra Gangele gangele...@gmail.com wrote:
 
 Hi All
 
 I am not getting any mail from this community?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: spark filestream problem

I have figured it out in the meantime - simply when moving file on HDFS it
preserves its time stamp and on the other hand the spark filestream adapter
seems to care as much about filenames as timestamps - hence NEW files with
OLD time stamps will NOT be processed - yuk 

The hack you can use is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp

-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Saturday, May 2, 2015 5:09 PM
To: user@spark.apache.org
Subject: spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception 



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem
-tp22743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: spark filestream problem

I have figured it out in the meantime - simply when moving file on HDFS it
preserves its time stamp and on the other hand the spark filestream adapter
seems to care as much about filenames as timestamps - hence NEW files with
OLD time stamps will NOT be processed - yuk 

The hack you can use is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp

-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Saturday, May 2, 2015 5:07 PM
To: user@spark.apache.org
Subject: spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem
-tp22742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Submit Kill Spark Application program programmatically from another application

2015-05-02 Thread Yijie Shen

Hi,

I am wondering if it is possible to submit, monitor  kill spark applications 
from another service.

I have wrote a service this:

parse user commands
translate them into understandable arguments to an already prepared Spark-SQL 
application
submit the application along with arguments to Spark Cluster using spark-submit 
from ProcessBuilder
run generated applications' driver in cluster mode.
The above 4 steps has been finished, but I have difficulties in these two:

Query about the applications status, for example, the percentage completion.
Kill queries accordingly
What I find in spark standalone documentation suggest kill application using:

./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID
And should find the driver ID through the standalone Master web UI at 
http://master url:8080.

Are there any programmatically methods I could get the driverID submitted by my 
`ProcessBuilder` and query status about the query?

Any Suggestions?

— 
Best Regards!
Yijie Shen

Re: Can I group elements in RDD into different groups and let each group share some elements?

Did you look at the cogroup transformation or the cartesian transformation ?

Regards,

Olivier.

Le sam. 2 mai 2015 à 22:01, Franz Chien franzj...@gmail.com a écrit :

 Hi all,

 Can I group elements in RDD into different groups and let each group share
 elements? For example, I have 10,000 elements in RDD from e1 to e1, and
 I want to group and aggregate them by another mapping with size of 2000,
 ex: ( (e1,e42), (e1,e554), (e3, e554)…… (2000th group))

 My first approach was to filter the RDD with mapping rules for 2000 times,
 and then union them together. However, it ran forever. Does SPARK provide a
 way to group elements in RDD like this please?


 Thanks,


 Franz

Re: real time Query engine Spark-SQL on Hbase

In the upcoming 1.4.0 release, SPARK-3468 should give you better clue.

Cheers

On Fri, May 1, 2015 at 12:30 PM, Siddharth Ubale 
siddharth.ub...@syncoms.com wrote:

  Hi,


  Thanks for the reply.


  Hbase cli takes less than 500 ms for the same query.

 I am running a simple query i.t Select * from Customers where
 c_id='123123'.

 Why would the same query which takes 500 ms at Hbase cli end up taking
 around 8 secs via Spark-Sql?

 I am unable t understand this.


  Thanks,

 Siddharth





  --
 *From:* ayan guha guha.a...@gmail.com
 *Sent:* 01 May 2015 04:38
 *To:* Ted Yu
 *Cc:* user@spark.apache.org; Siddharth Ubale; matei.zaha...@gmail.com;
 Prakash Hosalli; Amit Kumar
 *Subject:* Re: real time Query engine Spark-SQL on Hbase


 And if I may ask, how long it takes in hbase CLI? I would not expect spark
 to  improve performance of hbase. At best spark will push down the filter
 to hbase. So I would try to optimise any additional overhead like bringing
 data into spark.
 On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:

 bq. a single query on one filter criteria

  Can you tell us more about your filter ? How selective is it ?

  Which hbase release are you using ?

  Cheers

 On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale 
 siddharth.ub...@syncoms.com wrote:

  Hi,



 I want to use Spark as Query engine on HBase with sub second latency.



 I am  using Spark 1.3  version. And followed the steps below on Hbase
 table with around 3.5 lac rows :



 *1.   *Mapped the Dataframe to Hbase table .RDDCustomers maps to
 the hbase table which is used to create the Dataframe.

 *“ DataFrame schemaCustomers = sqlInstance*

 *
 .createDataFrame(SparkContextImpl.getRddCustomers(),*

 *
 Customers.class);” *

 2.   Used registertemp table i.e”
 *schemaCustomers.registerTempTable(customers);”*

 3.   Running the query on Dataframe using Sqlcontext Instance.



 What I am observing is that for a single query on one filter criteria
 the query is taking 7-8 seconds? And the time increases as I am increasing
 the number of rows in Hbase table. Also, there was one time when I was
 getting query response under 1-2 seconds. Seems like strange behavior.

 Is this expected behavior from Spark or am I missing something here?

 Can somebody help me understand this scenario . Please assist.



 Thanks,

 Siddharth Ubale,

Re: ClassNotFoundException for Kryo serialization

2015-05-02 Thread Akshat Aranya

Now I am running up against some other problem while trying to schedule tasks:

15/05/01 22:32:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2419)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


I verified that the same configuration works without using Kryo serialization.


On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya aara...@gmail.com wrote:
 I cherry-picked the fix for SPARK-5470 and the problem has gone away.

 On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya aara...@gmail.com wrote:
 Yes, this class is present in the jar that was loaded in the classpath
 of the executor Java process -- it wasn't even lazily added as a part
 of the task execution.  Schema$MyRow is a protobuf-generated class.

 After doing some digging around, I think I might be hitting up against
 SPARK-5470, the fix for which hasn't been merged into 1.2, as far as I
 can tell.

 On Fri, May 1, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:
 bq. Caused by: java.lang.ClassNotFoundException: com.example.Schema$MyRow

 So the above class is in the jar which was in the classpath ?
 Can you tell us a bit more about Schema$MyRow ?

 On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya aara...@gmail.com wrote:

 Hi,

 I'm getting a ClassNotFoundException at the executor when trying to
 register a class for Kryo serialization:

 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:243)
   at
 org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:254)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:257)
   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:182)
   at org.apache.spark.executor.Executor.init(Executor.scala:87)
   at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:61)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:36)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Failed to load class to
 register with Kryo
   at

Re: to split an RDD to multiple ones?

I guess :

val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_)
val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_)
val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_)

Regards,

Olivier.

Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com a écrit :

 Hi,

 I have an RDD *srdd* containing (unordered-)data like this:
 s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …

 What I want is (it will be much better if they could be in ascending
 order):
 *srdd_s1*:
 s1_0, s1_1, s1_2, …, s1_n
 *srdd_s2*:
 s2_0, s2_1, s2_2, …, s2_n
 *srdd_s3*:
 s3_0, s3_1, s3_2, …, s3_n
 …
 …

 Have any idea? Thanks in advance! :)


 Best,
 Yifan LI

Re: Drop a column from the DataFrame.

Sounds like a patch for a drop method...

Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit :

 Just use select() to create a new DataFrame with only the columns you want.
 Sort of the opposite of what you want -- but you can select all but the
 columns you want minus the one you don. You could even use a filter to
 remove just the one column you want on the fly:

 myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
 = new Column(colname)).toList : _* )



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

Can you post your code, otherwise there's not much we can do.

Regards,

Olivier.

Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab

Re: Drop a column from the DataFrame.

This is coming in 1.4.0
https://issues.apache.org/jira/browse/SPARK-7280



 On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote:
 
 Sounds like a patch for a drop method...
 
 Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit :
 Just use select() to create a new DataFrame with only the columns you want.
 Sort of the opposite of what you want -- but you can select all but the
 columns you want minus the one you don. You could even use a filter to
 remove just the one column you want on the fly:
 
 myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
 = new Column(colname)).toList : _* )
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI

Thanks, Olivier and Franz. :)


Best,
Yifan LI





 On 02 May 2015, at 23:23, Olivier Girardot ssab...@gmail.com wrote:
 
 I guess : 
 
 val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_)
 val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_)
 val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_)
 
 Regards, 
 
 Olivier.
 
 Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com 
 mailto:iamyifa...@gmail.com a écrit :
 Hi,
 
 I have an RDD srdd containing (unordered-)data like this:
 s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …
 
 What I want is (it will be much better if they could be in ascending order):
 srdd_s1:
 s1_0, s1_1, s1_2, …, s1_n
 srdd_s2:
 s2_0, s2_1, s2_2, …, s2_n
 srdd_s3:
 s3_0, s3_1, s3_2, …, s3_n
 …
 …
 
 Have any idea? Thanks in advance! :)
 
 
 Best,
 Yifan LI

Re: Spark - Timeout Issues - OutOfMemoryError