Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-06 Thread Ashish Dutt
Hi Aleksandar,
Quite some time ago, I faced the same problem and I found a solution which
I have posted here on my blog
.
See if that can help you and if it does not then you can check out these
questions & solution on stackoverflow
 website


Sincerely,
Ashish Dutt


On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski  wrote:

> Hi,
> I am successfully running python app via pyCharm in local mode
> setMaster("local[*]")
>
> When I turn on SparkConf().setMaster("yarn-client")
>
> and run via
>
> park-submit PysparkPandas.py
>
>
> I run into issue:
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>
> I am running java
> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>
> Should I try same thing with java 6/7
>
> Is this packaging issue or I have something wrong with configurations ...
>
> Regards,
>
> --
> Aleksandar Kacanski
>


Re: [streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Gerard Maas
You need to take into consideration 'where' things are executing. The
closure of the 'forEachRDD'  executes in the driver. Therefore, the log
statements printed during the execution of that part will be found in the
driver logs.
In contrast, the foreachPartition closure executes on the worker nodes. You
will find the '+++ForEachPartition+++' messages printed in the executor log.

So both statements execute, but in different locations of the distributed
computing environment (aka cluster)

-kr, Gerard.

On Sun, Sep 6, 2015 at 10:53 PM, Alexey Ponkin  wrote:

> Hi,
>
> I have the following code
>
> object MyJob extends org.apache.spark.Logging{
> ...
>  val source: DStream[SomeType] ...
>
>  source.foreachRDD { rdd =>
>   logInfo(s"""+++ForEachRDD+++""")
>   rdd.foreachPartition { partitionOfRecords =>
> logInfo(s"""+++ForEachPartition+++""")
>   }
>   }
>
> I was expecting to see both log messages in job log.
> But unfortunately you will never see string '+++ForEachPartition+++' in
> logs, cause block foreachPartition will never execute.
> And also there is no error message or something in logs.
> I wonder is this a bug or known behavior?
> I know that org.apache.spark.Logging is DeveloperAPI, but why it is
> silently fails with no messages?
> What to use instead of org.apache.spark.Logging? in spark-streaming jobs?
>
> P.S. running spark 1.4.1 (on yarn)
>
> Thanks in advance
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Понькин Алексей
OK, I got it.
When I use 'yarn logs -applicationId ' command everything appears in 
right place.
Thank you!

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


07.09.2015, 01:44, "Gerard Maas" :
> You need to take into consideration 'where' things are executing. The closure 
> of the 'forEachRDD'  executes in the driver. Therefore, the log statements 
> printed during the execution of that part will be found in the driver logs.
> In contrast, the foreachPartition closure executes on the worker nodes. You 
> will find the '+++ForEachPartition+++' messages printed in the executor log.
>
> So both statements execute, but in different locations of the distributed 
> computing environment (aka cluster)
>
> -kr, Gerard.
>
> On Sun, Sep 6, 2015 at 10:53 PM, Alexey Ponkin  wrote:
>> Hi,
>>
>> I have the following code
>>
>> object MyJob extends org.apache.spark.Logging{
>> ...
>>  val source: DStream[SomeType] ...
>>
>>  source.foreachRDD { rdd =>
>>       logInfo(s"""+++ForEachRDD+++""")
>>       rdd.foreachPartition { partitionOfRecords =>
>>         logInfo(s"""+++ForEachPartition+++""")
>>       }
>>   }
>>
>> I was expecting to see both log messages in job log.
>> But unfortunately you will never see string '+++ForEachPartition+++' in 
>> logs, cause block foreachPartition will never execute.
>> And also there is no error message or something in logs.
>> I wonder is this a bug or known behavior?
>> I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
>> fails with no messages?
>> What to use instead of org.apache.spark.Logging? in spark-streaming jobs?
>>
>> P.S. running spark 1.4.1 (on yarn)
>>
>> Thanks in advance
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Alexey Ponkin
Hi,

I have the following code

object MyJob extends org.apache.spark.Logging{
...
 val source: DStream[SomeType] ...

 source.foreachRDD { rdd =>
  logInfo(s"""+++ForEachRDD+++""")
  rdd.foreachPartition { partitionOfRecords =>
logInfo(s"""+++ForEachPartition+++""")
  }
  }

I was expecting to see both log messages in job log.
But unfortunately you will never see string '+++ForEachPartition+++' in logs, 
cause block foreachPartition will never execute.
And also there is no error message or something in logs.
I wonder is this a bug or known behavior? 
I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
fails with no messages?
What to use instead of org.apache.spark.Logging? in spark-streaming jobs?

P.S. running spark 1.4.1 (on yarn)

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-06 Thread Sasha Kacanski
Hi,
I am successfully running python app via pyCharm in local mode
setMaster("local[*]")

When I turn on SparkConf().setMaster("yarn-client")

and run via

park-submit PysparkPandas.py


I run into issue:
Error from python worker:
  /cube/PY/Python27/bin/python: No module named pyspark
PYTHONPATH was:

/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar

I am running java
hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Should I try same thing with java 6/7

Is this packaging issue or I have something wrong with configurations ...

Regards,

-- 
Aleksandar Kacanski


Re: Problem to persist Hibernate entity from Spark job

2015-09-06 Thread Zoran Jeremic
I have GenericDAO class which is initialized for each partition. This class
uses SessionFactory.openSession() to open a new session in it's
constructor. As per my understanding, this means that each partition have
different session, but they are using the same SessionFactory to open it.

why not create the session at the start of the saveInBatch method and close
> it at the end
>
This won't work for me, or at least I think it won't. At the beginning of
the process I load some entities (e.g. User, UserPreference...) from
hibernate and then I use it across the process, even after I perform
saveInBatch. It needs to be in session in order to pull data that I need
and update it later, so I can't open another session inside the existing
one.

On Sun, Sep 6, 2015 at 1:40 AM, Matthew Johnson 
wrote:

> I agree with Igor - I would either make sure session is ThreadLocal or,
> more simply, why not create the session at the start of the saveInBatch
> method and close it at the end? Creating a SessionFactory is an expensive
> operation but creating a Session is a relatively cheap one.
> On 6 Sep 2015 07:27, "Igor Berman"  wrote:
>
>> how do you create your session? do you reuse it across threads? how do
>> you create/close session manager?
>> look for the problem in session creation, probably something deadlocked,
>> as far as I remember hib.session should be created per thread
>>
>> On 6 September 2015 at 07:11, Zoran Jeremic 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm developing long running process that should find RSS feeds that all
>>> users in the system have registered to follow, parse these RSS feeds,
>>> extract new entries and store it back to the database as Hibernate
>>> entities, so user can retrieve it. I want to use Apache Spark to enable
>>> parallel processing, since this process might take several hours depending
>>> on the number of users.
>>>
>>> The approach I thought should work was to use
>>> *useridsRDD.foreachPartition*, so I can have separate hibernate session
>>> for each partition. I created Database session manager that is initialized
>>> for each partition which keeps hibernate session alive until the process is
>>> over.
>>>
>>> Once all RSS feeds from one source are parsed and Feed entities are
>>> created, I'm sending the whole list to Database Manager method that saves
>>> the whole list in batch:
>>>
 public   void saveInBatch(List entities) {
 try{
   boolean isActive = session.getTransaction().isActive();
 if ( !isActive) {
 session.beginTransaction();
 }
for(Object entity:entities){
  session.save(entity);
 }
session.getTransaction().commit();
  }catch(Exception ex){
 if(session.getTransaction()!=null) {
 session.getTransaction().rollback();
 ex.printStackTrace();
}
   }

 However, this works only if I have one Spark partition. If there are
>>> two or more partitions, the whole process is blocked once I try to save the
>>> first entity. In order to make the things simpler, I tried to simplify Feed
>>> entity, so it doesn't refer and is not referred from any other entity. It
>>> also doesn't have any collection.
>>>
>>> I hope that some of you have already tried something similar and could
>>> give me idea how to solve this problem
>>>
>>> Thanks,
>>> Zoran
>>>
>>>
>>


-- 
***
Zoran Jeremic, PhD
Senior System Analyst & Programmer

Athabasca University
Tel: +1 604 92 89 944
E-mail: zoran.jere...@gmail.com 
Homepage:  http://zoranjeremic.org
**


Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread shenyan zhen
Thank you both - yup: the /tmp disk space was filled up:)

On Sun, Sep 6, 2015 at 11:51 AM, Ted Yu  wrote:

> Use the following command if needed:
> df -i /tmp
>
> See
> https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available
>
> On Sun, Sep 6, 2015 at 6:15 AM, Shixiong Zhu  wrote:
>
>> The folder is in "/tmp" by default. Could you use "df -h" to check the
>> free space of /tmp?
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-09-05 9:50 GMT+08:00 shenyan zhen :
>>
>>> Has anyone seen this error? Not sure which dir the program was trying to
>>> write to.
>>>
>>> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
>>> mode.
>>>
>>> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
>>> (java.io.IOException: No space left on device), was the --addJars option
>>> used?
>>>
>>> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>>>
>>> java.io.IOException: No space left on device
>>>
>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>>
>>> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>>>
>>> at
>>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>>>
>>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>>>
>>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>>>
>>> at
>>> java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>>>
>>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>>>
>>> at
>>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>>
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>>>
>>> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>>>
>>> Thanks,
>>> Shenyan
>>>
>>
>>
>


Spark - launchng job for each action

2015-09-06 Thread Priya Ch
Hi All,

 In Spark, each action results in launching a job. Lets say my spark app
looks as-

val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
val rdd1 = baseRdd.map(x => x+2)
val rdd2 = rdd1.filter(x => x%2 ==0)
val count = rdd2.count
val firstElement = rdd2.first

println("Count is"+count)
println("First is"+firstElement)

Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
with 1 task. Here in job2, when calculating rdd.first, is the entire
lineage computed again or else as job0 already computes rdd2, is it reused
???

Thanks,
Padma Ch


Re: Spark - launchng job for each action

2015-09-06 Thread ayan guha
Hi

"... Here in job2, when calculating rdd.first..."

If you mean if rdd2.first, then it uses rdd2 already computed by
rdd2.count, because it is already available. If some partitions are not
available due to GC, then only those partitions are recomputed.

On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang  wrote:

> If you want to reuse the data, you need to call rdd2.cache
>
>
>
> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch 
> wrote:
>
>> Hi All,
>>
>>  In Spark, each action results in launching a job. Lets say my spark app
>> looks as-
>>
>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
>> val rdd1 = baseRdd.map(x => x+2)
>> val rdd2 = rdd1.filter(x => x%2 ==0)
>> val count = rdd2.count
>> val firstElement = rdd2.first
>>
>> println("Count is"+count)
>> println("First is"+firstElement)
>>
>> Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>> lineage computed again or else as job0 already computes rdd2, is it reused
>> ???
>>
>> Thanks,
>> Padma Ch
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards,
Ayan Guha


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Terry Hole
Hi, Owen,

The dataframe "training" is from a RDD of case class: RDD[LabeledDocument],
while the case class is defined as this:
case class LabeledDocument(id: Long, text: String, *label: Double*)

So there is already has the default "label" column with "double" type.

I already tried to set the label column for decision tree as this:
val lr = new
DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
It raised the same error.

I also tried to change the "label" to "int" type, it also reported error
like following stack, I have no idea how to make this work.

java.lang.IllegalArgumentException: requirement failed: *Column label must
be of type DoubleType but was actually IntegerType*.
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
at
org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at
org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at
org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at
scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
at
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
at $iwC$$iwC$$iwC$$iwC.(:70)
at $iwC$$iwC$$iwC.(:72)
at $iwC$$iwC.(:74)
at $iwC.(:76)
at (:78)
at .(:82)
at .()
at .(:7)
at .()
at $print()

Thanks!
- Terry

On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen  wrote:

> I think somewhere alone the line you've not specified your label
> column -- it's defaulting to "label" and it does not recognize it, or
> at least not as a binary or nominal attribute.
>
> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole  wrote:
> > Hi, Experts,
> >
> > I followed the guide of spark ml pipe to test DecisionTreeClassifier on
> > spark shell with spark 1.4.1, but always meets error like following, do
> you
> > have any idea how to fix this?
> >
> > The error stack:
> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
> input
> > with invalid label column label, without the number of classes specified.
> > See StringIndexer.
> > at
> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> > at
> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> > at
> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> > at
> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> >
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> > at
> >
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
> > at
> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> > at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> > at $iwC$$iwC$$iwC$$iwC.(:57)
> > at $iwC$$iwC$$iwC.(:59)
> > at $iwC$$iwC.(:61)
> > at $iwC.(:63)
> > at (:65)
> > at .(:69)
> > at .()
> > at .(:7)
> > at .()
> > at $print()
> >
> > The execute code is:
> > // Labeled and unlabeled instance types.
> > // Spark SQL can infer schema from case 

Re: Problem to persist Hibernate entity from Spark job

2015-09-06 Thread Matthew Johnson
I agree with Igor - I would either make sure session is ThreadLocal or,
more simply, why not create the session at the start of the saveInBatch
method and close it at the end? Creating a SessionFactory is an expensive
operation but creating a Session is a relatively cheap one.
On 6 Sep 2015 07:27, "Igor Berman"  wrote:

> how do you create your session? do you reuse it across threads? how do you
> create/close session manager?
> look for the problem in session creation, probably something deadlocked,
> as far as I remember hib.session should be created per thread
>
> On 6 September 2015 at 07:11, Zoran Jeremic 
> wrote:
>
>> Hi,
>>
>> I'm developing long running process that should find RSS feeds that all
>> users in the system have registered to follow, parse these RSS feeds,
>> extract new entries and store it back to the database as Hibernate
>> entities, so user can retrieve it. I want to use Apache Spark to enable
>> parallel processing, since this process might take several hours depending
>> on the number of users.
>>
>> The approach I thought should work was to use
>> *useridsRDD.foreachPartition*, so I can have separate hibernate session
>> for each partition. I created Database session manager that is initialized
>> for each partition which keeps hibernate session alive until the process is
>> over.
>>
>> Once all RSS feeds from one source are parsed and Feed entities are
>> created, I'm sending the whole list to Database Manager method that saves
>> the whole list in batch:
>>
>>> public   void saveInBatch(List entities) {
>>> try{
>>>   boolean isActive = session.getTransaction().isActive();
>>> if ( !isActive) {
>>> session.beginTransaction();
>>> }
>>>for(Object entity:entities){
>>>  session.save(entity);
>>> }
>>>session.getTransaction().commit();
>>>  }catch(Exception ex){
>>> if(session.getTransaction()!=null) {
>>> session.getTransaction().rollback();
>>> ex.printStackTrace();
>>>}
>>>   }
>>>
>>> However, this works only if I have one Spark partition. If there are two
>> or more partitions, the whole process is blocked once I try to save the
>> first entity. In order to make the things simpler, I tried to simplify Feed
>> entity, so it doesn't refer and is not referred from any other entity. It
>> also doesn't have any collection.
>>
>> I hope that some of you have already tried something similar and could
>> give me idea how to solve this problem
>>
>> Thanks,
>> Zoran
>>
>>
>


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
I think somewhere alone the line you've not specified your label
column -- it's defaulting to "label" and it does not recognize it, or
at least not as a binary or nominal attribute.

On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole  wrote:
> Hi, Experts,
>
> I followed the guide of spark ml pipe to test DecisionTreeClassifier on
> spark shell with spark 1.4.1, but always meets error like following, do you
> have any idea how to fix this?
>
> The error stack:
> java.lang.IllegalArgumentException: DecisionTreeClassifier was given input
> with invalid label column label, without the number of classes specified.
> See StringIndexer.
> at
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> at
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> at
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC$$iwC.(:59)
> at $iwC$$iwC.(:61)
> at $iwC.(:63)
> at (:65)
> at .(:69)
> at .()
> at .(:7)
> at .()
> at $print()
>
> The execute code is:
> // Labeled and unlabeled instance types.
> // Spark SQL can infer schema from case classes.
> case class LabeledDocument(id: Long, text: String, label: Double)
> case class Document(id: Long, text: String)
> // Prepare training documents, which are labeled.
> val training = sc.parallelize(Seq(
>   LabeledDocument(0L, "a b c d e spark", 1.0),
>   LabeledDocument(1L, "b d", 0.0),
>   LabeledDocument(2L, "spark f g h", 1.0),
>   LabeledDocument(3L, "hadoop mapreduce", 0.0)))
>
> // Configure an ML pipeline, which consists of three stages: tokenizer,
> hashingTF, and lr.
> val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
> val hashingTF = new
> HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
> val lr =  new
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini")
> val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
>
> // Error raises from the following line
> val model = pipeline.fit(training.toDF)
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem to persist Hibernate entity from Spark job

2015-09-06 Thread Igor Berman
how do you create your session? do you reuse it across threads? how do you
create/close session manager?
look for the problem in session creation, probably something deadlocked, as
far as I remember hib.session should be created per thread

On 6 September 2015 at 07:11, Zoran Jeremic  wrote:

> Hi,
>
> I'm developing long running process that should find RSS feeds that all
> users in the system have registered to follow, parse these RSS feeds,
> extract new entries and store it back to the database as Hibernate
> entities, so user can retrieve it. I want to use Apache Spark to enable
> parallel processing, since this process might take several hours depending
> on the number of users.
>
> The approach I thought should work was to use
> *useridsRDD.foreachPartition*, so I can have separate hibernate session
> for each partition. I created Database session manager that is initialized
> for each partition which keeps hibernate session alive until the process is
> over.
>
> Once all RSS feeds from one source are parsed and Feed entities are
> created, I'm sending the whole list to Database Manager method that saves
> the whole list in batch:
>
>> public   void saveInBatch(List entities) {
>> try{
>>   boolean isActive = session.getTransaction().isActive();
>> if ( !isActive) {
>> session.beginTransaction();
>> }
>>for(Object entity:entities){
>>  session.save(entity);
>> }
>>session.getTransaction().commit();
>>  }catch(Exception ex){
>> if(session.getTransaction()!=null) {
>> session.getTransaction().rollback();
>> ex.printStackTrace();
>>}
>>   }
>>
>> However, this works only if I have one Spark partition. If there are two
> or more partitions, the whole process is blocked once I try to save the
> first entity. In order to make the things simpler, I tried to simplify Feed
> entity, so it doesn't refer and is not referred from any other entity. It
> also doesn't have any collection.
>
> I hope that some of you have already tried something similar and could
> give me idea how to solve this problem
>
> Thanks,
> Zoran
>
>


Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread Shixiong Zhu
The folder is in "/tmp" by default. Could you use "df -h" to check the free
space of /tmp?

Best Regards,
Shixiong Zhu

2015-09-05 9:50 GMT+08:00 shenyan zhen :

> Has anyone seen this error? Not sure which dir the program was trying to
> write to.
>
> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
> mode.
>
> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
> (java.io.IOException: No space left on device), was the --addJars option
> used?
>
> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>
> java.io.IOException: No space left on device
>
> at java.io.FileOutputStream.writeBytes(Native Method)
>
> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>
> at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>
> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>
> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>
> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>
> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>
> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>
> at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>
> Thanks,
> Shenyan
>


Re: Spark - launchng job for each action

2015-09-06 Thread Priya Ch
Hi All,

 Thanks for the info. I have one more doubt -
When writing a streaming application, I specify batch-interval. Lets say if
the interval is 1sec, for every 1sec batch, rdd is formed and launches a
job. If there are >1 action specified on an rddhow many jobs would it
launch???

I mean every 1sec batch launches a job and suppose there are two actions
then internally 2 more jobs launched ?

On Sun, Sep 6, 2015 at 1:15 PM, ayan guha  wrote:

> Hi
>
> "... Here in job2, when calculating rdd.first..."
>
> If you mean if rdd2.first, then it uses rdd2 already computed by
> rdd2.count, because it is already available. If some partitions are not
> available due to GC, then only those partitions are recomputed.
>
> On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang  wrote:
>
>> If you want to reuse the data, you need to call rdd2.cache
>>
>>
>>
>> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch 
>> wrote:
>>
>>> Hi All,
>>>
>>>  In Spark, each action results in launching a job. Lets say my spark app
>>> looks as-
>>>
>>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
>>> val rdd1 = baseRdd.map(x => x+2)
>>> val rdd2 = rdd1.filter(x => x%2 ==0)
>>> val count = rdd2.count
>>> val firstElement = rdd2.first
>>>
>>> println("Count is"+count)
>>> println("First is"+firstElement)
>>>
>>> Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
>>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>>> lineage computed again or else as job0 already computes rdd2, is it reused
>>> ???
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
(Sean)
The error suggests that the type is not a binary or nominal attribute
though. I think that's the missing step. A double-valued column need
not be one of these attribute types.

On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole  wrote:
> Hi, Owen,
>
> The dataframe "training" is from a RDD of case class: RDD[LabeledDocument],
> while the case class is defined as this:
> case class LabeledDocument(id: Long, text: String, label: Double)
>
> So there is already has the default "label" column with "double" type.
>
> I already tried to set the label column for decision tree as this:
> val lr = new
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
> It raised the same error.
>
> I also tried to change the "label" to "int" type, it also reported error
> like following stack, I have no idea how to make this work.
>
> java.lang.IllegalArgumentException: requirement failed: Column label must be
> of type DoubleType but was actually IntegerType.
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
> at
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
> at
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
> at
> org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
> at
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> at
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
> at
> scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
> at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
> at
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at $iwC$$iwC$$iwC$$iwC.(:70)
> at $iwC$$iwC$$iwC.(:72)
> at $iwC$$iwC.(:74)
> at $iwC.(:76)
> at (:78)
> at .(:82)
> at .()
> at .(:7)
> at .()
> at $print()
>
> Thanks!
> - Terry
>
> On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen  wrote:
>>
>> I think somewhere alone the line you've not specified your label
>> column -- it's defaulting to "label" and it does not recognize it, or
>> at least not as a binary or nominal attribute.
>>
>> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole  wrote:
>> > Hi, Experts,
>> >
>> > I followed the guide of spark ml pipe to test DecisionTreeClassifier on
>> > spark shell with spark 1.4.1, but always meets error like following, do
>> > you
>> > have any idea how to fix this?
>> >
>> > The error stack:
>> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
>> > input
>> > with invalid label column label, without the number of classes
>> > specified.
>> > See StringIndexer.
>> > at
>> >
>> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
>> > at
>> >
>> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
>> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>> > at
>> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
>> > at
>> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> > at
>> >
>> > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>> > at
>> >
>> > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
>> > at
>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
>> > at
>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>> > at 

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Terry Hole
Sean

Do you know how to tell decision tree that the "label" is a binary or set
some attributes to dataframe to carry number of classes?

Thanks!
- Terry

On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen  wrote:

> (Sean)
> The error suggests that the type is not a binary or nominal attribute
> though. I think that's the missing step. A double-valued column need
> not be one of these attribute types.
>
> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole  wrote:
> > Hi, Owen,
> >
> > The dataframe "training" is from a RDD of case class:
> RDD[LabeledDocument],
> > while the case class is defined as this:
> > case class LabeledDocument(id: Long, text: String, label: Double)
> >
> > So there is already has the default "label" column with "double" type.
> >
> > I already tried to set the label column for decision tree as this:
> > val lr = new
> >
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
> > It raised the same error.
> >
> > I also tried to change the "label" to "int" type, it also reported error
> > like following stack, I have no idea how to make this work.
> >
> > java.lang.IllegalArgumentException: requirement failed: Column label
> must be
> > of type DoubleType but was actually IntegerType.
> > at scala.Predef$.require(Predef.scala:233)
> > at
> >
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
> > at
> >
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
> > at
> >
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
> > at
> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
> > at
> >
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> > at
> >
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> > at
> >
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> > at
> >
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
> > at
> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
> > at
> org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
> > at
> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
> > at
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> > at
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
> > at
> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
> > at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> > at $iwC$$iwC$$iwC$$iwC.(:70)
> > at $iwC$$iwC$$iwC.(:72)
> > at $iwC$$iwC.(:74)
> > at $iwC.(:76)
> > at (:78)
> > at .(:82)
> > at .()
> > at .(:7)
> > at .()
> > at $print()
> >
> > Thanks!
> > - Terry
> >
> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen  wrote:
> >>
> >> I think somewhere alone the line you've not specified your label
> >> column -- it's defaulting to "label" and it does not recognize it, or
> >> at least not as a binary or nominal attribute.
> >>
> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole 
> wrote:
> >> > Hi, Experts,
> >> >
> >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier
> on
> >> > spark shell with spark 1.4.1, but always meets error like following,
> do
> >> > you
> >> > have any idea how to fix this?
> >> >
> >> > The error stack:
> >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
> >> > input
> >> > with invalid label column label, without the number of classes
> >> > specified.
> >> > See StringIndexer.
> >> > at
> >> >
> >> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> >> > at
> >> >
> >> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> >> > at
> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> >> > at
> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> >> > at
> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> > at
> >> >
> >> >
> 

udaf with multiple return values in spark 1.5.0

2015-09-06 Thread Simon Hafner
Hi everyone

is it possible to return multiple values with an udaf defined in spark
1.5.0? The documentation [1] mentions

abstract def dataType: DataType
The DataType of the returned value of this UserDefinedAggregateFunction.

so it's only possible to return a single value. Should I use ArrayType
as a WA here? The returned values are all doubles.

Cheers

[1] 
https://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext initialization error- java.io.IOException: No space left on device

2015-09-06 Thread Ted Yu
Use the following command if needed:
df -i /tmp

See
https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available

On Sun, Sep 6, 2015 at 6:15 AM, Shixiong Zhu  wrote:

> The folder is in "/tmp" by default. Could you use "df -h" to check the
> free space of /tmp?
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-05 9:50 GMT+08:00 shenyan zhen :
>
>> Has anyone seen this error? Not sure which dir the program was trying to
>> write to.
>>
>> I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client
>> mode.
>>
>> 15/09/04 21:36:06 ERROR SparkContext: Error adding jar
>> (java.io.IOException: No space left on device), was the --addJars option
>> used?
>>
>> 15/09/04 21:36:08 ERROR SparkContext: Error initializing SparkContext.
>>
>> java.io.IOException: No space left on device
>>
>> at java.io.FileOutputStream.writeBytes(Native Method)
>>
>> at java.io.FileOutputStream.write(FileOutputStream.java:300)
>>
>> at
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:178)
>>
>> at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:213)
>>
>> at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:318)
>>
>> at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:163)
>>
>> at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:338)
>>
>> at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:432)
>>
>> at
>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
>>
>> at
>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>>
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>>
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>>
>> at org.apache.spark.SparkContext.(SparkContext.scala:497)
>>
>> Thanks,
>> Shenyan
>>
>
>


Re: ClassCastException in driver program

2015-09-06 Thread Shixiong Zhu
Looks there are some circular references in SQL making the immutable List
serialization fail in 2.11.

In 2.11, Scala immutable List uses writeReplace()/readResolve() which don't
play nicely with circular references. Here is an example to reproduce this
issue in 2.11.6:

  class Foo extends Serializable {
var l: Seq[Any] = null
  }

  import java.io._

  val o = new ByteArrayOutputStream()
  val o1 = new ObjectOutputStream(o)
  val m = new Foo
  val n = List(1, m)
  m.l = n
  o1.writeObject(n)
  o1.close()
  val i = new ByteArrayInputStream(o.toByteArray)
  val i1 = new ObjectInputStream(i)
  i1.readObject()

Could you provide the "explain" output? It would be helpful to find the
circular references.



Best Regards,
Shixiong Zhu

2015-09-05 0:26 GMT+08:00 Jeff Jones :

> We are using Scala 2.11 for a driver program that is running Spark SQL
> queries in a standalone cluster. I’ve rebuilt Spark for Scala 2.11 using
> the instructions at
> http://spark.apache.org/docs/latest/building-spark.html.  I’ve had to
> work through a few dependency conflict but all-in-all it seems to work for
> some simple Spark examples. I integrated the Spark SQL code into my
> application and I’m able to run using a local client, but when I switch
> over to the standalone cluster I get the following error.  Any help
> tracking this down would be appreciated.
>
> This exception occurs during a DataFrame.collect() call. I’ve tried to use
> –Dsun.io.serialization.extendedDebugInfo=true to get more information but
> it didn’t provide anything more.
>
> [error] o.a.s.s.TaskSetManager - Task 0 in stage 1.0 failed 4 times;
> aborting job
>
> [error] c.a.i.c.Analyzer - Job aborted due to stage failure: Task 0 in
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
> (TID 4, 10.248.0.242): java.lang.ClassCastException: cannot assign instance
> of scala.collection.immutable.List$SerializationProxy to field
> org.apache.spark.sql.execution.Project.projectList of type
> scala.collection.Seq in instance of org.apache.spark.sql.execution.Project
>
> at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown
> Source)
>
> at java.io.ObjectStreamClass.setObjFieldValues(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.readObject(Unknown Source)
>
> at
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>
> at java.lang.reflect.Method.invoke(Unknown Source)
>
> at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
>
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
>
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
>
> at java.io.ObjectInputStream.readObject0(Unknown Source)
>
> at java.io.ObjectInputStream.readObject(Unknown Source)
>
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)

Re: Problem with repartition/OOM

2015-09-06 Thread Yana Kadiyska
Thanks Yanbo,

I was running with 1G per executor; my file is 7.5 G, running with the
standard block size of 128M, resulting in 7500/128M= 59 partitions
naturally. My boxes have 8CPUs, so I figured they could be processing 8
tasks/partitions at a time, needing

8*(partition_size) memory per executor, so 8*128M = 1G

Is this the right way to do this math?

I'm confused about _decreasing_ the number of partitions -- I thought from
a spark perspective, 7.5G / 10 partitions would result in 750M per
partition. So a Spark executor with 8 cores would potentially need
750*8=6000M of memory

Maybe my confusion comes from terminology -- I thought in Spark the
"default" number of partitions is always the number of input splits. From
your example (number of partitions) * (Parquet block size) = Minimum
Required Memory,
yet this would also be the Parquet overall file size from my
understanding  (number
of partitions) = FileSize/(Parquet block size)

It cannot be that Minimum Required Memory= Parquet file size


On Sat, Sep 5, 2015 at 11:00 PM, Yanbo Liang  wrote:

> The Parquet output writer allocates one block for each table partition it
> is processing and writes partitions in parallel. It will run out of
> memory if (number of partitions) times (Parquet block size) is greater than
> the available memory. You can try to decrease the number of partitions. And
> could you share the value of "parquet.block.size" and your available memory?
>
> 2015-09-05 18:59 GMT+08:00 Yana Kadiyska :
>
>> Hi folks, I have a strange issue. Trying to read a 7G file and do failry
>> simple stuff with it:
>>
>> I can read the file/do simple operations on it. However, I'd prefer to
>> increase the number of partitions in preparation for more memory-intensive
>> operations (I'm happy to wait, I just need the job to complete).
>> Repartition seems to cause an OOM for me?
>> Could someone shed light/or speculate/ why this would happen -- I thought
>> we repartition higher to relieve memory pressure?
>>
>> Im using Spark1.4.1 CDH4 if that makes a difference
>>
>> This works
>>
>> val res2 = sqlContext.parquetFile(lst:_*).where($"customer_id"===lit(254))
>> res2.count
>> res1: Long = 77885925
>>
>> scala> res2.explain
>> == Physical Plan ==
>> Filter (customer_id#314 = 254)
>>  PhysicalRDD [4], MapPartitionsRDD[11] at
>>
>> scala> res2.rdd.partitions.size
>> res3: Int = 59
>>
>> ​
>>
>>
>> This doesnt:
>>
>> scala> res2.repartition(60).count
>> [Stage 2:>(1 + 45) / 
>> 59]15/09/05 10:17:21 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 
>> 62, fqdn): java.lang.OutOfMemoryError: Java heap space
>> at 
>> parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:729)
>> at 
>> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:490)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:116)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>> at 
>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>> at 
>> org.apache.spark.sql.sources.SqlNewHadoopRDD$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>> at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:207)
>> at 
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> ​
>>
>
>