RE: Exiting driver main() method...

2015-05-02 Thread Mohammed Guller
No, you don’t need to do anything special. Perhaps, your application is getting 
stuck somewhere? If you can share your code, someone may be able to help.

Mohammed

From: James Carman [mailto:ja...@carmanconsulting.com]
Sent: Friday, May 1, 2015 5:53 AM
To: user@spark.apache.org
Subject: Exiting driver main() method...

In all the examples, it seems that the spark application doesn't really do 
anything special in order to exit.  When I run my application, however, the 
spark-submit script just hangs there at the end.  Is there something special 
I need to do to get that thing to exit normally?


Re: real time Query engine Spark-SQL on Hbase

2015-05-02 Thread Siddharth Ubale
Hi,


Thanks for the reply.


Hbase cli takes less than 500 ms for the same query.

I am running a simple query i.t Select * from Customers where c_id='123123'.

Why would the same query which takes 500 ms at Hbase cli end up taking around 8 
secs via Spark-Sql?

I am unable t understand this.


Thanks,

Siddharth






From: ayan guha guha.a...@gmail.com
Sent: 01 May 2015 04:38
To: Ted Yu
Cc: user@spark.apache.org; Siddharth Ubale; matei.zaha...@gmail.com; Prakash 
Hosalli; Amit Kumar
Subject: Re: real time Query engine Spark-SQL on Hbase


And if I may ask, how long it takes in hbase CLI? I would not expect spark to  
improve performance of hbase. At best spark will push down the filter to hbase. 
So I would try to optimise any additional overhead like bringing data into 
spark.

On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com 
wrote:
bq. a single query on one filter criteria

Can you tell us more about your filter ? How selective is it ?

Which hbase release are you using ?

Cheers

On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale 
siddharth.ub...@syncoms.commailto:siddharth.ub...@syncoms.com wrote:
Hi,

I want to use Spark as Query engine on HBase with sub second latency.

I am  using Spark 1.3  version. And followed the steps below on Hbase table 
with around 3.5 lac rows :


1.   Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase 
table which is used to create the Dataframe.

 DataFrame schemaCustomers = sqlInstance


.createDataFrame(SparkContextImpl.getRddCustomers(),

Customers.class);

2.   Used registertemp table i.e 
schemaCustomers.registerTempTable(customers);

3.   Running the query on Dataframe using Sqlcontext Instance.

What I am observing is that for a single query on one filter criteria the query 
is taking 7-8 seconds? And the time increases as I am increasing the number of 
rows in Hbase table. Also, there was one time when I was getting query response 
under 1-2 seconds. Seems like strange behavior.
Is this expected behavior from Spark or am I missing something here?
Can somebody help me understand this scenario . Please assist.

Thanks,
Siddharth Ubale,




com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread shahab
Hi,

I am using sprak-1.2.0 and I used Kryo serialization but I get the
following excepton.

java.io.IOException: com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

I do apprecciate if anyone could tell me how I can resolve this?

best,
/Shahab


not getting any mail

2015-05-02 Thread Jeetendra Gangele
Hi All

I am not getting any mail from this community?


Remoting warning when submitting to cluster

2015-05-02 Thread javidelgadillo
Hello all!!

We've been prototyping some spark applications to read messages from Kafka
topics.  The application is quite simple, we use KafkaUtils.createStream to
receive a stream of CSV messages from a Kafka Topic.  We parse the CSV and
count the number of messages we get in each RDD. At a high-level (removing
the abstractions of our appliction), it looks like this:

val sc = new SparkConf()
  .setAppName(appName)
  .set(spark.executor.memory, 1024m)
  .set(spark.cores.max, 3)
  .set(spark.app.name, appName)
  .set(spark.ui.port, sparkUIPort)

 val ssc =  new StreamingContext(sc, Milliseconds(emitInterval.toInt))

KafkaUtils
  .createStream(ssc, zookeeperQuorum, consumerGroup, topicMap)
  .map(_._2)
  .foreachRDD( (rdd:RDD, time: Time) = {
println(Time %s: (%s total records).format(time, rdd.count()))
  }

When I submit this using to spark master as local[3] everything behaves as
I'd expect.  After some startup overhead, I'm seeing the count printed to be
the same as the count I'm simulating  (1 every second for example).

When I submit this to a spark master using spark://master.host:7077, the
behavior is different.  The overhead go start receiving seems longer and
some runs I don't see anything for 30 seconds even though my simulator is
sending messages to the topic.  I also see the following error written to
stderr by every executor assigned to the job:

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/05/01 10:11:38 INFO SecurityManager: Changing view acls to: username
15/05/01 10:11:38 INFO SecurityManager: Changing modify acls to: username
15/05/01 10:11:38 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(javi4211);
users with modify permissions: Set(username)
15/05/01 10:11:38 INFO Slf4jLogger: Slf4jLogger started
15/05/01 10:11:38 INFO Remoting: Starting remoting
15/05/01 10:11:39 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://driverpropsfetc...@master.host:56534]
15/05/01 10:11:39 INFO Utils: Successfully started service
'driverPropsFetcher' on port 56534.
15/05/01 10:11:40 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkdri...@driver.host:51837]. Address is now gated for
5000 ms, all messages to this address will be delivered to dead letters.
Reason: Connection refused: no further information:
driver.host/10.27.51.214:51837
15/05/01 10:12:09 ERROR UserGroupInformation: PriviledgedActionException
as:username cause:java.util.concurrent.TimeoutException: Futures timed out
after [30 seconds]
Exception in thread main java.lang.reflect.UndeclaredThrowableException:
Unknown exception in doAs
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:224)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException:
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)


Is there something else I need to do configure to ensure akka remoting will
work correctly when running spark cluster?  Or can I ignore this error?

-Javier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Remoting-warning-when-submitting-to-cluster-tp22733.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark - Hive Metastore MySQL driver

2015-05-02 Thread Ted Yu
Can you try the patch from:
[SPARK-6913][SQL] Fixed java.sql.SQLException: No suitable driver found

Cheers

On Sat, Mar 28, 2015 at 12:41 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 This is from my Hive installation

 -sh-4.1$ ls /apache/hive/lib  | grep derby

 derby-10.10.1.1.jar

 derbyclient-10.10.1.1.jar

 derbynet-10.10.1.1.jar


 -sh-4.1$ ls /apache/hive/lib  | grep datanucleus

 datanucleus-api-jdo-3.2.6.jar

 datanucleus-core-3.2.10.jar

 datanucleus-rdbms-3.2.9.jar


 -sh-4.1$ ls /apache/hive/lib  | grep mysql

 mysql-connector-java-5.0.8-bin.jar

 -sh-4.1$


 $ hive --version

 Hive 0.13.0.2.1.3.6-2

 Subversion
 git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
 -r 87da9430050fb9cc429d79d95626d26ea382b96c


 $



 On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I tried with a different version of driver but same error

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
 */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files
 $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
 --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

 On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 This is what am seeing



 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
 --files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by:
 org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
 The specified datastore driver (com.mysql.jdbc.Driver) was not found in
 the CLASSPATH. Please check your CLASSPATH specification, and the name of
 the driver.




 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
 *--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1
 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G
 --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by: java.sql.SQLException: No suitable driver found for
 jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB
 at java.sql.DriverManager.getConnection(DriverManager.java:596)


 Looks like the driver jar that i got in is not correct,

 On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Could someone please share the spark-submit command that shows their
 mysql jar containing driver class used to connect to Hive MySQL meta store.

 Even after including it through

 

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread dsgriffin
val newRdd = myRdd.map(row = row ++ Array((row(1).toLong *
row(199).toLong).toString))



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exiting driver main() method...

2015-05-02 Thread Akhil Das
It used to exit without any problem for me. You can basically check in the
driver UI (that runs on 4040) and see what exactly its doing.

Thanks
Best Regards

On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com
wrote:

 In all the examples, it seems that the spark application doesn't really do
 anything special in order to exit.  When I run my application, however, the
 spark-submit script just hangs there at the end.  Is there something
 special I need to do to get that thing to exit normally?



empty jdbc RDD in spark

2015-05-02 Thread Hafiz Mujadid
Hi all!
I am trying to read hana database using spark jdbc RDD
here is my code
def readFromHana() {
val conf = new SparkConf()
conf.setAppName(test).setMaster(local)
val sc = new SparkContext(conf)
val rdd = new JdbcRDD(sc, () = {
  Class.forName(com.sap.db.jdbc.Driver).newInstance()
 
DriverManager.getConnection(jdbc:sap://54.69.200.113:30015/?currentschema=LIVE2,
mujadid, 786Xyz123)
},
  SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,
  0, 100, 1,
  (r: ResultSet) =  convert(r) )
println(rdd.count());
sc.stop()
  }
  def convert(rs: ResultSet):String={
  val rsmd = rs.getMetaData()
  val numberOfColumns = rsmd.getColumnCount()
  var i = 1
  val row=new StringBuilder
  while (i = numberOfColumns) {
row.append( rs.getString(i)+,)
i += 1
  }
  row.toString()
   }

The resultant count is 0

Any suggestion?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Drop a column from the DataFrame.

2015-05-02 Thread dsgriffin
Just use select() to create a new DataFrame with only the columns you want.
Sort of the opposite of what you want -- but you can select all but the
columns you want minus the one you don. You could even use a filter to
remove just the one column you want on the fly:

myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
= new Column(colname)).toList : _* )



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.logConf with log4j.rootCategory=WARN

2015-05-02 Thread Akhil Das
It could be.

Thanks
Best Regards

On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote:

 Hi,

   I have recently enable log4j.rootCategory=WARN, console in spark
 configuration. but after that spark.logConf=True has becomes ineffective.

   So just want to confirm if this is because  log4j.rootCategory=WARN ?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-logConf-with-log4j-rootCategory-WARN-tp22731.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-02 Thread Akhil Das
There was a similar discussion over here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E

Thanks
Best Regards

On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote:

 *Resending as I do not see that this made it to the mailing list, sorry if
 in fact it did an is just nor reflected online yet.*

 I’m very perplexed with the following. I have a set of AVRO generated
 objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming
 job follows the receiver-based approach. I am encountering the below
 error when I attempt to de serialize the payload:

 15/04/30 17:49:25 INFO MapOutputTrackerMasterActor: Asked to send map output 
 locations for shuffle 9 to sparkExecutor@192.168.1.3:6105115/04/30 17:49:25 
 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 140 
 bytes15/04/30 17:49:25 ERROR TaskResultGetter: Exception while getting task 
 resultcom.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 relations (com.opsdatastore.model.ObjectDetails)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.NullPointerException
 at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 ... 17 more15/04/30 17:49:25 INFO TaskSchedulerImpl: Removed TaskSet 20.0, 
 whose tasks have all completed, from pool

 Basic code looks like this.

 Register the class with Kryo as follows:

 val sc = new SparkConf(true)
   .set(spark.streaming.unpersist, true)
   .setAppName(StreamingKafkaConsumer)
   .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

 // register all related AVRO generated classes
 sc.registerKryoClasses(Array(
 classOf[ConfigurationProperty],
 classOf[Event],
 classOf[Identifier],
 classOf[Metric],
 classOf[ObjectDetails],
 classOf[Relation],
 classOf[RelationProperty]
 ))

 Use the receiver based approach to consume messages from Kafka:

  val messages = KafkaUtils.createStream[Array[Byte], Array[Byte], 
 DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics, storageLevel)

 Now process the received messages:

 val raw = messages.map(_._2)
 val dStream = raw.map(
   byte = {
 // Avro Decoder
 println(Byte length:  + byte.length)
 val decoder = new AvroDecoder[ObjectDetails](schema = 
 ObjectDetails.getClassSchema)
 val message = decoder.fromBytes(byte)
 println(sAvroMessage : Type : ${message.getType}, Payload : $message)
 message
   }
 )

 When i look in the logs of the workers, in standard out i can se the
 messages being printed, in fact I’m even able to access the Type field with
 out issue:

 Byte length: 315
 AvroMessage : Type : Storage, Payload : {name: Storage 1, type: 
 Storage, vendor: 6274g51cbkmkqisk, model: lk95hqk9m10btaot, 
 timestamp: 1430428565141, identifiers: {ID: {name: ID, value: 
 Storage-1}}, configuration: null, metrics: {Disk Space Usage (GB): 
 {name: Disk Space Usage 

Re: Spark worker error on standalone cluster

2015-05-02 Thread Michael Ryabtsev
Thanks Akhil,

I am trying to investigate this path. The spark is the same, but may be
there is a difference in Hadoop.

On Sat, May 2, 2015 at 6:25 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Just make sure your are having the same version of spark in your cluster
 and the project's build file.

 Thanks
 Best Regards

 On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) 
 mich...@totango.com wrote:

 Hi everyone,

 I have a spark application that works fine on a standalone Spark cluster
 that runs on my laptop
 (master and one worker), but fails when I try to run in on a standalone
 Spark cluster
 deployed on EC2 (master and worker are on different machines).
 The application structure goes in the following way:
 There is a java process ('message processor') that runs on the same
 machine
 as
 Spark master.  When it starts, it submits itself to Spark master, then,
 it listens on SQS and on each received message, it should run a spark job
 to
 process a file from S3, which address is configured in the message .
 It looks like all this fails at the point where the Spark driver tries to
 send the job
 to the Spark executer.
 Below is the code from the 'message processor' that configures the
 SparkContext,
 Then the Spark driver log, and then the Spark executor log.
 The outputs of my code and some important points are marked in bold and
 I've simplified the code and logs in some places for the sake of
 readability.
 Would appreciate your help very much, because I've run out of ideas with
 this problem.

 'message processor' code:
 ===
 ===
 ||
 logger.info(*Started Integration Hub SubmitDriver in test mode*.);

 SparkConf sparkConf = new SparkConf()
 .setMaster(SPARK_MASTER_URI)
 .setAppName(APPLICATION_NAME)
 .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE);

 sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass()));

 // configure spark executor to use log4j properties located in the local
 spark conf dir
 sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC
 -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

 sparkConf.set(spark.executor.memory, 1g);
 sparkConf.set(spark.cores.max, 3);
 // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced
 performance
 sparkConf.set(spark.shuffle.spill, true);

 logger.info(*Connecting Spark*);
 JavaSparkContext sc = new JavaSparkContext(sparkConf);

 sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY);
 sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET);

 logger.info(*Spark connected*);
 ||

 ==

 Driver log:

 ==||
 2015-05-01 07:47:14 INFO  ClassPathBeanDefinitionScanner:239 - JSR-330
 'javax.inject.Named' annotation found and supported for component scanning
 2015-05-01 07:47:14 INFO  AnnotationConfigApplicationContext:510 -
 Refreshing

 org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b
 :
 startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy
 2015-05-01 07:47:14 INFO  AutowiredAnnotationBeanPostProcessor:140 -
 JSR-330
 'javax.inject.Inject' annotation found and supported for autowiring
 2015-05-01 07:47:14 INFO  DefaultListableBeanFactory:596 -
 Pre-instantiating
 singletons in

 org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e
 :
 defining beans

 [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver];
 root of factory hierarchy
 *2015-05-01 07:47:15 INFO  SubmitDriver:69 - Started Integration Hub
 SubmitDriver in test mode.
 2015-05-01 07:47:15 INFO  SubmitDriver:101 - Connecting Spark
 *2015-05-01 07:47:15 INFO  SparkContext:59 - Running Spark version 1.3.0
 2015-05-01 07:47:16 WARN  NativeCodeLoader:62 - Unable to load
 native-hadoop
 library for your platform... using builtin-java classes where applicable
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing view acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing modify acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - SecurityManager:
 authentication disabled; ui acls 

spark filestrea problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception



sparkR equivalent to SparkContext.newAPIHadoopRDD?

2015-05-02 Thread David Holiday
Hi gang,

I'm giving sparkR a test drive and am bummed to discover that the SparkContext 
API in sparkR is only a subset of what's available in stock spark. 
Specifically, I need to be able to pull data from accumulo into sparkR. I can 
do it with stock spark but can't figure out how to make the magic happen with 
sparkR. Anyone got any ideas?

thanks!


DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com


[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.AnnaiSystems.comhttp://www.AnnaiSystems.com



Problem in Standalone Mode

2015-05-02 Thread drarse
When I run my program with Spark-Submit everythink are ok. But when I try
run in satandalone mode I obtain the nex Exceptions:

((This is with

val df = sqlContext.jsonFile(./datos.json)

))
java.io.EOFException
[error] at
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)


This is SparkConf

val sparkConf = new SparkConf().setAppName(myApp)
  .setMaster(spark://master:7077)
  .setSparkHome(/usr/local/spark/)
  .setJars(Seq(./target/scala-2.10/myApp.jar))




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Standalone-Mode-tp22741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Help with publishing to Kafka from Spark Streaming?

2015-05-02 Thread Saisai Shao
Here is the pull request, you may refer to this:

https://github.com/apache/spark/pull/2994

Thanks
Jerry


2015-05-01 14:38 GMT+08:00 Pavan Sudheendra pavan0...@gmail.com:

 Link to the question:

 http://stackoverflow.com/questions/29974017/spark-kafka-producer-not-serializable-exception

 Thanks for any pointers.



Re: how to pass configuration properties from driver to executor?

2015-05-02 Thread Akhil Das
Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped
to the executors.

Thanks
Best Regards

On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com
wrote:

 Hi,

 We've had a similar problem, but with log4j properties file.
 The only working way we've found, was externally deploying the properties
 file on the worker machine to the spark conf folder and configuring the
 executor jvm options with:

 sparkConf.set(spark.executor.extraJavaOptions,
 -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

 It is done in our case from the java code, but I think it can be also a
 parameter to spark-submit.

 Regards,

 Michael.

 On Fri, May 1, 2015 at 2:26 AM, Tian Zhang tzhang...@yahoo.com wrote:

 Hi,

 We have a scenario as below and would like your suggestion.
 We have app.conf file with propX=A as default built into the fat jar file
 that is provided to spark-submit
 WE have env.conf file with propX=B that would like spark-submit to take as
 input to overwrite the default and populate to both driver and executors.
 Note in the executor, we are using some package that is using typesafe
 config to read configuration properties.

 How do we do that?

 Thanks.

 Tian



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-pass-configuration-properties-from-driver-to-executor-tp22728.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





[PSA] Use Stack Overflow!

2015-05-02 Thread Nick Chammas
This mailing list sees a lot of traffic every day.

With such a volume of mail, you may find it hard to find discussions you
are interested in, and if you are the one starting discussions you may
sometimes feel your mail is going into a black hole.

We can't change the nature of this mailing list (it is required by the
Apache foundation), but there is an alternative in Stack Overflow
http://stackoverflow.com/.

Stack Overflow has an active Apache Spark tag
http://stackoverflow.com/questions/tagged/apache-spark, and Spark
committers like Sean Owen http://stackoverflow.com/users/64174/sean-owen
and Josh Rosen http://stackoverflow.com/users/590203/josh-rosen are
active under that tag, along with several contributors and of course
regular users.

Try it out! I think when your question fits Stack Overflow's guidelines
http://stackoverflow.com/help/on-topic, it is generally better to ask it
there than on this mailing list.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PSA-Use-Stack-Overflow-tp22732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001
I have a list of cloudera jars which I need to provide in --jars clause,
mainly for the HiveContext functionality I am using. However,  many of these
jars have version number as part of their names. This leads to an issue that
the names might change when I do a Cloudera upgrade.

Just a note here, there are many jars which cloudera exposes as a symlink
which is the link to the latest version of that jar(e.g
/opt/cloudera/parcels/CDH/lib/parquet/parquet-hadoop-bundle.jar -
parquet-hadoop-bundle-1.5.0-cdh5.3.2.jar),in which case its good but there
are many jars which aren't.

Is there a flexible way to avoid this situation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Generating-version-agnostic-jar-path-value-for-jars-clause-tp22734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark worker error on standalone cluster

2015-05-02 Thread Akhil Das
Just make sure your are having the same version of spark in your cluster
and the project's build file.

Thanks
Best Regards

On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) 
mich...@totango.com wrote:

 Hi everyone,

 I have a spark application that works fine on a standalone Spark cluster
 that runs on my laptop
 (master and one worker), but fails when I try to run in on a standalone
 Spark cluster
 deployed on EC2 (master and worker are on different machines).
 The application structure goes in the following way:
 There is a java process ('message processor') that runs on the same machine
 as
 Spark master.  When it starts, it submits itself to Spark master, then,
 it listens on SQS and on each received message, it should run a spark job
 to
 process a file from S3, which address is configured in the message .
 It looks like all this fails at the point where the Spark driver tries to
 send the job
 to the Spark executer.
 Below is the code from the 'message processor' that configures the
 SparkContext,
 Then the Spark driver log, and then the Spark executor log.
 The outputs of my code and some important points are marked in bold and
 I've simplified the code and logs in some places for the sake of
 readability.
 Would appreciate your help very much, because I've run out of ideas with
 this problem.

 'message processor' code:
 ===
 ===
 ||
 logger.info(*Started Integration Hub SubmitDriver in test mode*.);

 SparkConf sparkConf = new SparkConf()
 .setMaster(SPARK_MASTER_URI)
 .setAppName(APPLICATION_NAME)
 .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE);

 sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass()));

 // configure spark executor to use log4j properties located in the local
 spark conf dir
 sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC
 -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties);

 sparkConf.set(spark.executor.memory, 1g);
 sparkConf.set(spark.cores.max, 3);
 // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced
 performance
 sparkConf.set(spark.shuffle.spill, true);

 logger.info(*Connecting Spark*);
 JavaSparkContext sc = new JavaSparkContext(sparkConf);

 sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY);
 sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET);

 logger.info(*Spark connected*);
 ||

 ==

 Driver log:

 ==||
 2015-05-01 07:47:14 INFO  ClassPathBeanDefinitionScanner:239 - JSR-330
 'javax.inject.Named' annotation found and supported for component scanning
 2015-05-01 07:47:14 INFO  AnnotationConfigApplicationContext:510 -
 Refreshing

 org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b
 :
 startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy
 2015-05-01 07:47:14 INFO  AutowiredAnnotationBeanPostProcessor:140 -
 JSR-330
 'javax.inject.Inject' annotation found and supported for autowiring
 2015-05-01 07:47:14 INFO  DefaultListableBeanFactory:596 -
 Pre-instantiating
 singletons in

 org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e
 :
 defining beans

 [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver];
 root of factory hierarchy
 *2015-05-01 07:47:15 INFO  SubmitDriver:69 - Started Integration Hub
 SubmitDriver in test mode.
 2015-05-01 07:47:15 INFO  SubmitDriver:101 - Connecting Spark
 *2015-05-01 07:47:15 INFO  SparkContext:59 - Running Spark version 1.3.0
 2015-05-01 07:47:16 WARN  NativeCodeLoader:62 - Unable to load
 native-hadoop
 library for your platform... using builtin-java classes where applicable
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing view acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - Changing modify acls to:
 hadoop
 2015-05-01 07:47:16 INFO  SecurityManager:59 - SecurityManager:
 authentication disabled; ui acls disabled; users with view permissions:
 Set(hadoop); users with modify permissions: Set(hadoop)
 2015-05-01 07:47:18 INFO  Slf4jLogger:80 - Slf4jLogger started
 2015-05-01 07:47:18 INFO  Remoting:74 - 

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Enabling Event Log

2015-05-02 Thread Jeetendra Gangele
is it working now?

On 1 May 2015 at 13:43, James King jakwebin...@gmail.com wrote:

 Oops! well spotted. Many thanks Shixiong.

 On Fri, May 1, 2015 at 1:25 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 spark.history.fs.logDirectory is for the history server. For Spark
 applications, they should use spark.eventLog.dir. Since you commented out
 spark.eventLog.dir, it will be /tmp/spark-events. And this folder does
 not exits.

 Best Regards,
 Shixiong Zhu

 2015-04-29 23:22 GMT-07:00 James King jakwebin...@gmail.com:

 I'm unclear why I'm getting this exception.

 It seems to have realized that I want to enable  Event Logging but
 ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which
 does exist.

 spark-default.conf

 # Example:
 spark.master spark://master1:7077,master2:7077
 spark.eventLog.enabled   true
 spark.history.fs.logDirectoryfile:/opt/cb/tmp/spark-events
 # spark.eventLog.dir   hdfs://namenode:8021/directory
 # spark.serializer
 org.apache.spark.serializer.KryoSerializer
 # spark.driver.memory  5g
 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
 -Dnumbers=one two three

 Exception following job submission:

 spark.eventLog.enabled=true
 spark.history.fs.logDirectory=file:/opt/cb/tmp/spark-events

 spark.jars=file:/opt/cb/scripts/spark-streamer/cb-spark-streamer-1.0-SNAPSHOT.jar
 spark.master=spark://master1:7077,master2:7077
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.
 at
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99)
 at org.apache.spark.SparkContext.init(SparkContext.scala:399)
 at
 org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
 at
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:75)
 at
 org.apache.spark.streaming.api.java.JavaStreamingContext.init(JavaStreamingContext.scala:132)


 Many Thanks
 jk






to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI
Hi,

I have an RDD srdd containing (unordered-)data like this:
s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …

What I want is (it will be much better if they could be in ascending order):
srdd_s1:
s1_0, s1_1, s1_2, …, s1_n
srdd_s2:
s2_0, s2_1, s2_2, …, s2_n
srdd_s3:
s3_0, s3_1, s3_2, …, s3_n
…
…

Have any idea? Thanks in advance! :)


Best,
Yifan LI







Re: DataFrame filter referencing error

2015-05-02 Thread Francesco Bigarella
First of all, thank you for your replies.

I was previously doing this via normal jdbc connection and it worked
without problems. Then I liked the idea that sparksql could take care of
opening/closing the connection.

I tried also with single quotes, since that was my first guess but didn't
work. I fear I will have to look at the spark code but I'm the only one
with this issue.

BTW I'm testing with spark 1.3.0

Best,
Francesco

On Fri, May 1, 2015, 00:54 ayan guha guha.a...@gmail.com wrote:

 I think you need to specify new in single quote. My guess is the query
 showing up in dB is like
 ...where status=new or
 ...where status=new
 Either case mysql assumes new is a column.
 What you need is the form below
 ...where status='new'

 You need to provide your quotes accordingly.

 Easiest way would be to do it would in a separate jdbc conn to mysql using
 a simple standalone programme, not in spark.
 On 1 May 2015 07:47, Burak Yavuz brk...@gmail.com wrote:

 Is new a reserved word for MySQL?

 On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella 
 francesco.bigare...@gmail.com wrote:

 Do you know how I can check that? I googled a bit but couldn't find a
 clear explanation about it. I also tried to use explain() but it doesn't
 really help.
 I still find unusual that I have this issue only for the equality
 operator but not for the others.

 Thank you,
 F

 On Wed, Apr 29, 2015 at 3:03 PM ayan guha guha.a...@gmail.com wrote:

 Looks like you DF is based on a MySQL DB using jdbc, and error is
 thrown from mySQL. Can you see what SQL is finally getting fired in MySQL?
 Spark is pushing down the predicate to mysql so its not a spark problem
 perse

 On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella 
 francesco.bigare...@gmail.com wrote:

 Hi all,

 I was testing the DataFrame filter functionality and I found what I
 think is a strange behaviour.
 My dataframe testDF, obtained loading aMySQL table via jdbc, has the
 following schema:
 root
  | -- id: long (nullable = false)
  | -- title: string (nullable = true)
  | -- value: string (nullable = false)
  | -- status: string (nullable = false)

 What I want to do is filter my dataset to obtain all rows that have a
 status = new.

 scala testDF.filter(testDF(id) === 1234).first()
 works fine (also with the integer value within double quotes), however
 if I try to use the same statement to filter on the status column (also
 with changes in the syntax - see below), suddenly the program breaks.

 Any of the following
 scala testDF.filter(testDF(status) === new)
 scala testDF.filter(status = 'new')
 scala testDF.filter($status === new)

 generates the error:

 INFO scheduler.DAGScheduler: Job 3 failed: runJob at
 SparkPlan.scala:121, took 0.277907 s

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage 3.0 (TID 12, node name):
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column
 'new' in 'where clause'

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
 at com.mysql.jdbc.Util.getInstance(Util.java:386)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625)
 at
 com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119)
 at
 com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283)
 at org.apache.spark.sql.jdbc.JDBCRDD$anon$1.init(JDBCRDD.scala:328)
 at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: empty jdbc RDD in spark

2015-05-02 Thread Ted Yu
bq.   SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,

Have you tried dropping limit and offset clause from the above query ?

Cheers

On Fri, May 1, 2015 at 1:56 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi all!
 I am trying to read hana database using spark jdbc RDD
 here is my code
 def readFromHana() {
 val conf = new SparkConf()
 conf.setAppName(test).setMaster(local)
 val sc = new SparkContext(conf)
 val rdd = new JdbcRDD(sc, () = {
   Class.forName(com.sap.db.jdbc.Driver).newInstance()

 DriverManager.getConnection(jdbc:sap://
 54.69.200.113:30015/?currentschema=LIVE2,
 mujadid, 786Xyz123)
 },
   SELECT *  FROM MEMBERS LIMIT ? OFFSET  ?,
   0, 100, 1,
   (r: ResultSet) =  convert(r) )
 println(rdd.count());
 sc.stop()
   }
   def convert(rs: ResultSet):String={
   val rsmd = rs.getMetaData()
   val numberOfColumns = rsmd.getColumnCount()
   var i = 1
   val row=new StringBuilder
   while (i = numberOfColumns) {
 row.append( rs.getString(i)+,)
 i += 1
   }
   row.toString()
}

 The resultant count is 0

 Any suggestion?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Number of input partitions in SparkContext.sequenceFile

2015-05-02 Thread Archit Thakur
Hi,

How did u check no of splits in ur file. Did i run ur mr job or calculated
it.?
 The formula for split size is
max(minSize, min(max size, block size)). Can u check if it satisfies ur
case.?

Thanks  Regards,
Archit Thakur.

On Saturday, April 25, 2015, Wenlei Xie wenlei@gmail.com wrote:

 Hi,

 I checked the number of partitions by

 System.out.println(INFO: RDD with  + rdd.partitions().size() + 
 partitions created.);


 Each single split is about 100MB. I am currently loading the data from
 local file system, would this explains this observation?

 Thank you!

 Best,
 Wenlei

 On Tue, Apr 21, 2015 at 6:28 AM, Archit Thakur archit279tha...@gmail.com
 javascript:_e(%7B%7D,'cvml','archit279tha...@gmail.com'); wrote:

 Hi,

 It should generate the same no of partitions as the no. of splits.
 Howd you check no of partitions.? Also please paste your file size and
 hdfs-site.xml and mapred-site.xml here.

 Thanks and Regards,
 Archit Thakur.

 On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wenlei@gmail.com
 javascript:_e(%7B%7D,'cvml','wenlei@gmail.com'); wrote:

 Hi,

 I am wondering the mechanism that determines the number of partitions
 created by SparkContext.sequenceFile ?

 For example, although my file has only 4 splits, Spark would create 16
 partitions for it. Is it determined by the file size? Is there any way to
 control it? (Looks like I can only tune minPartitions but not maxPartitions)

 Thank you!

 Best,
 Wenlei






 --
 Wenlei Xie (谢文磊)

 Ph.D. Candidate
 Department of Computer Science
 456 Gates Hall, Cornell University
 Ithaca, NY 14853, USA
 Email: wenlei@gmail.com
 javascript:_e(%7B%7D,'cvml','wenlei@gmail.com');



Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter
Thanks for your reply! It is what I am after.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: not getting any mail

2015-05-02 Thread Ted Yu
Looks like there were delays across Apache project mailing lists. 

Emails are coming through now. 



 On May 2, 2015, at 9:14 AM, Jeetendra Gangele gangele...@gmail.com wrote:
 
 Hi All
 
 I am not getting any mail from this community?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
I have figured it out in the meantime - simply when moving file on HDFS it
preserves its time stamp and on the other hand the spark filestream adapter
seems to care as much about filenames as timestamps - hence NEW files with
OLD time stamps will NOT be processed - yuk 

The hack you can use is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp

-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Saturday, May 2, 2015 5:09 PM
To: user@spark.apache.org
Subject: spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception 



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem
-tp22743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
I have figured it out in the meantime - simply when moving file on HDFS it
preserves its time stamp and on the other hand the spark filestream adapter
seems to care as much about filenames as timestamps - hence NEW files with
OLD time stamps will NOT be processed - yuk 

The hack you can use is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp

-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Saturday, May 2, 2015 5:07 PM
To: user@spark.apache.org
Subject: spark filestream problem

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem
-tp22742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submit Kill Spark Application program programmatically from another application

2015-05-02 Thread Yijie Shen
Hi,

I am wondering if it is possible to submit, monitor  kill spark applications 
from another service.

I have wrote a service this:

parse user commands
translate them into understandable arguments to an already prepared Spark-SQL 
application
submit the application along with arguments to Spark Cluster using spark-submit 
from ProcessBuilder
run generated applications' driver in cluster mode.
The above 4 steps has been finished, but I have difficulties in these two:

Query about the applications status, for example, the percentage completion.
Kill queries accordingly
What I find in spark standalone documentation suggest kill application using:

./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID
And should find the driver ID through the standalone Master web UI at 
http://master url:8080.

Are there any programmatically methods I could get the driverID submitted by my 
`ProcessBuilder` and query status about the query?

Any Suggestions?

— 
Best Regards!
Yijie Shen

Re: Can I group elements in RDD into different groups and let each group share some elements?

2015-05-02 Thread Olivier Girardot
Did you look at the cogroup transformation or the cartesian transformation ?

Regards,

Olivier.

Le sam. 2 mai 2015 à 22:01, Franz Chien franzj...@gmail.com a écrit :

 Hi all,

 Can I group elements in RDD into different groups and let each group share
 elements? For example, I have 10,000 elements in RDD from e1 to e1, and
 I want to group and aggregate them by another mapping with size of 2000,
 ex: ( (e1,e42), (e1,e554), (e3, e554)…… (2000th group))

 My first approach was to filter the RDD with mapping rules for 2000 times,
 and then union them together. However, it ran forever. Does SPARK provide a
 way to group elements in RDD like this please?


 Thanks,


 Franz



Re: real time Query engine Spark-SQL on Hbase

2015-05-02 Thread Ted Yu
In the upcoming 1.4.0 release, SPARK-3468 should give you better clue.

Cheers

On Fri, May 1, 2015 at 12:30 PM, Siddharth Ubale 
siddharth.ub...@syncoms.com wrote:

  Hi,


  Thanks for the reply.


  Hbase cli takes less than 500 ms for the same query.

 I am running a simple query i.t Select * from Customers where
 c_id='123123'.

 Why would the same query which takes 500 ms at Hbase cli end up taking
 around 8 secs via Spark-Sql?

 I am unable t understand this.


  Thanks,

 Siddharth





  --
 *From:* ayan guha guha.a...@gmail.com
 *Sent:* 01 May 2015 04:38
 *To:* Ted Yu
 *Cc:* user@spark.apache.org; Siddharth Ubale; matei.zaha...@gmail.com;
 Prakash Hosalli; Amit Kumar
 *Subject:* Re: real time Query engine Spark-SQL on Hbase


 And if I may ask, how long it takes in hbase CLI? I would not expect spark
 to  improve performance of hbase. At best spark will push down the filter
 to hbase. So I would try to optimise any additional overhead like bringing
 data into spark.
 On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:

 bq. a single query on one filter criteria

  Can you tell us more about your filter ? How selective is it ?

  Which hbase release are you using ?

  Cheers

 On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale 
 siddharth.ub...@syncoms.com wrote:

  Hi,



 I want to use Spark as Query engine on HBase with sub second latency.



 I am  using Spark 1.3  version. And followed the steps below on Hbase
 table with around 3.5 lac rows :



 *1.   *Mapped the Dataframe to Hbase table .RDDCustomers maps to
 the hbase table which is used to create the Dataframe.

 *“ DataFrame schemaCustomers = sqlInstance*

 *
 .createDataFrame(SparkContextImpl.getRddCustomers(),*

 *
 Customers.class);” *

 2.   Used registertemp table i.e”
 *schemaCustomers.registerTempTable(customers);”*

 3.   Running the query on Dataframe using Sqlcontext Instance.



 What I am observing is that for a single query on one filter criteria
 the query is taking 7-8 seconds? And the time increases as I am increasing
 the number of rows in Hbase table. Also, there was one time when I was
 getting query response under 1-2 seconds. Seems like strange behavior.

 Is this expected behavior from Spark or am I missing something here?

 Can somebody help me understand this scenario . Please assist.



 Thanks,

 Siddharth Ubale,







Re: ClassNotFoundException for Kryo serialization

2015-05-02 Thread Akshat Aranya
Now I am running up against some other problem while trying to schedule tasks:

15/05/01 22:32:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2419)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


I verified that the same configuration works without using Kryo serialization.


On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya aara...@gmail.com wrote:
 I cherry-picked the fix for SPARK-5470 and the problem has gone away.

 On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya aara...@gmail.com wrote:
 Yes, this class is present in the jar that was loaded in the classpath
 of the executor Java process -- it wasn't even lazily added as a part
 of the task execution.  Schema$MyRow is a protobuf-generated class.

 After doing some digging around, I think I might be hitting up against
 SPARK-5470, the fix for which hasn't been merged into 1.2, as far as I
 can tell.

 On Fri, May 1, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:
 bq. Caused by: java.lang.ClassNotFoundException: com.example.Schema$MyRow

 So the above class is in the jar which was in the classpath ?
 Can you tell us a bit more about Schema$MyRow ?

 On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya aara...@gmail.com wrote:

 Hi,

 I'm getting a ClassNotFoundException at the executor when trying to
 register a class for Kryo serialization:

 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:243)
   at
 org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:254)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:257)
   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:182)
   at org.apache.spark.executor.Executor.init(Executor.scala:87)
   at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:61)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:36)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Failed to load class to
 register with Kryo
   at
 

Re: to split an RDD to multiple ones?

2015-05-02 Thread Olivier Girardot
I guess :

val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_)
val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_)
val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_)

Regards,

Olivier.

Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com a écrit :

 Hi,

 I have an RDD *srdd* containing (unordered-)data like this:
 s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …

 What I want is (it will be much better if they could be in ascending
 order):
 *srdd_s1*:
 s1_0, s1_1, s1_2, …, s1_n
 *srdd_s2*:
 s2_0, s2_1, s2_2, …, s2_n
 *srdd_s3*:
 s3_0, s3_1, s3_2, …, s3_n
 …
 …

 Have any idea? Thanks in advance! :)


 Best,
 Yifan LI








Re: Drop a column from the DataFrame.

2015-05-02 Thread Olivier Girardot
Sounds like a patch for a drop method...

Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit :

 Just use select() to create a new DataFrame with only the columns you want.
 Sort of the opposite of what you want -- but you can select all but the
 columns you want minus the one you don. You could even use a filter to
 remove just the one column you want on the fly:

 myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
 = new Column(colname)).toList : _* )



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread Olivier Girardot
Can you post your code, otherwise there's not much we can do.

Regards,

Olivier.

Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab



Re: Drop a column from the DataFrame.

2015-05-02 Thread Ted Yu
This is coming in 1.4.0
https://issues.apache.org/jira/browse/SPARK-7280



 On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote:
 
 Sounds like a patch for a drop method...
 
 Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit :
 Just use select() to create a new DataFrame with only the columns you want.
 Sort of the opposite of what you want -- but you can select all but the
 columns you want minus the one you don. You could even use a filter to
 remove just the one column you want on the fly:
 
 myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname
 = new Column(colname)).toList : _* )
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


Re: to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI
Thanks, Olivier and Franz. :)


Best,
Yifan LI





 On 02 May 2015, at 23:23, Olivier Girardot ssab...@gmail.com wrote:
 
 I guess : 
 
 val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_)
 val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_)
 val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_)
 
 Regards, 
 
 Olivier.
 
 Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com 
 mailto:iamyifa...@gmail.com a écrit :
 Hi,
 
 I have an RDD srdd containing (unordered-)data like this:
 s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, …
 
 What I want is (it will be much better if they could be in ascending order):
 srdd_s1:
 s1_0, s1_1, s1_2, …, s1_n
 srdd_s2:
 s2_0, s2_1, s2_2, …, s2_n
 srdd_s3:
 s3_0, s3_1, s3_2, …, s3_n
 …
 …
 
 Have any idea? Thanks in advance! :)
 
 
 Best,
 Yifan LI
 
 
 
 
 



Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-02 Thread Akhil Das
You could try repartitioning your listings RDD, also doing a collectAsMap
would basically bring all your data to driver, in that case you might want
to set the storage level as Memory and disk not sure that will do any help
on the driver though.

Thanks
Best Regards

On Thu, Apr 30, 2015 at 11:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Full Exception
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
 VISummaryDataProvider.scala:37) failed in 884.087 s*
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
 at VISummaryDataProvider.scala:37, took 1093.418249 s*
 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]
 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Exception while getting task result:
 org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)])


 *Code at line 37*

 val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
 .collectAsMap

 Listing data set size is 26G (10 files) and my driver memory is 12G (I
 cant go beyond it). The reason i do collectAsMap is to brodcast it and do a
 map-side join instead of regular join.


 Please suggest ?


 On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 My Spark Job is failing  and i see

 ==

 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)


 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]


 I see multiple of these

 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]

 And finally i see this
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at
 org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:95)
 at
 

Can I group elements in RDD into different groups and let each group share some elements?‏

2015-05-02 Thread Franz Chien
Hi all,

Can I group elements in RDD into different groups and let each group share
elements? For example, I have 10,000 elements in RDD from e1 to e1, and
I want to group and aggregate them by another mapping with size of 2000,
ex: ( (e1,e42), (e1,e554), (e3, e554)…… (2000th group))

My first approach was to filter the RDD with mapping rules for 2000 times,
and then union them together. However, it ran forever. Does SPARK provide a
way to group elements in RDD like this please?


Thanks,


Franz


Re: directory loader in windows

2015-05-02 Thread ayan guha
Thanks for answer. I am now trying to set HADOOP_HOME but the issues still
persists. Also, I can see only windows-utils.exe in my HADDOP_HOME, but no
WINUTILS.EXE.

I do not have hadoop installed in my system, as I am not using HDFS, but I
am using Spark 1.3.1 prebuilt with Hadoop 2.6. AM I missing something?

Best
Ayan

On Tue, Apr 28, 2015 at 12:45 AM, Steve Loughran ste...@hortonworks.com
wrote:


  This a hadoop-side stack trace

  it looks like the code is trying to get the filesystem permissions by
 running

  %HADOOP_HOME%\bin\WINUTILS.EXE  ls -F


  and something is triggering a null pointer exception.

  There isn't any HADOOP- JIRA with this specific stack trace in it, so
 it's not a known/fixed problem.

  At a guess, your environment HADOOP_HOME environment variable isn't
 point to the right place. If that's the case there should have been a
 warning in the logs




 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.lang.NullPointerException

  at java.lang.ProcessBuilder.start(Unknown Source)

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)

  at org.apache.hadoop.util.Shell.run(Shell.java:455)

  at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

  at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)

  at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)

  at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)

  at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)

  at
 org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557)

  at
 org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42)

  at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699)

  at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)

  at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)

  at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

  at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

  at scala.Option.getOrElse(Option.scala:120)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

  at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

  at scala.Option.getOrElse(Option.scala:120)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

  at
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

  at scala.Option.getOrElse(Option.scala:120)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

  at org.apache.spark.rdd.RDD.collect(RDD.scala:813)

  at
 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374)

  at
 org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)

  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

  at java.lang.reflect.Method.invoke(Unknown Source)

  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

  at py4j.Gateway.invoke(Gateway.java:259)

  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

  at py4j.commands.CallCommand.execute(CallCommand.java:79)

  at py4j.GatewayConnection.run(GatewayConnection.java:207)

  at java.lang.Thread.run(Unknown Source)

  --
 Best Regards,
 Ayan Guha









-- 
Best Regards,
Ayan Guha