Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
I think it does because user doesn't exactly see their application logic
and flow as spark internal does. Off course we follow general guidelines
for performance but we shouldn't care really how exactly spark decide to
execute DAG. Spark scheduler or core can keep changing over time to
optimize it. So optimizing from user perspective is to look at what
transformation they are using and what they are doing inside those
transformation. If user have some transparency from framework on how those
transformation are utilizing resources over time or where they are failing
we can better optimize it . That way we are focused on our application
logic rather what framework is doing underneath.

About soln, doesn't spark driver (spark context + event listner) have
knowledge of every job, taskset, task and their current state? Spark UI can
relate job to stage to task then why not stage to transformation.

Again my real point is to assess this as an requirement from users,
stakeholders perspective regardless of technical challenge.

Thanks
Nirav

On Wed, May 25, 2016 at 8:04 PM, Mark Hamstra 
wrote:

> But when you talk about optimizing the DAG, it really doesn't make sense
> to also talk about transformation steps as separate entities.  The
> DAGScheduler knows about Jobs, Stages, TaskSets and Tasks.  The
> TaskScheduler knows about TaskSets ad Tasks.  Neither of them understands
> the transformation steps that you used to define your RDD -- at least not
> as separable, distinct steps.  To give the kind of
> transformation-step-oriented information that you want would require parts
> of Spark that don't currently concern themselves at all with RDD
> transformation steps to start tracking them and how they map to Jobs,
> Stages, TaskSets and Tasks -- and when you start talking about Datasets and
> Spark SQL, you then needing to start talking about tracking and mapping
> concepts like Plans, Schemas and Queries.  It would introduce significant
> new complexity.
>
> On Wed, May 25, 2016 at 6:59 PM, Nirav Patel 
> wrote:
>
>> Hi Mark,
>>
>> I might have said stage instead of step in my last statement "UI just
>> says Collect failed but in fact it could be any stage in that lazy chain of
>> evaluation."
>>
>> Anyways even you agree that this visibility of underlaying steps wont't
>> be available. which does pose difficulties in terms of troubleshooting as
>> well as optimizations at step level. I think users will have hard time
>> without this. Its great that spark community working on different levels of
>> internal optimizations but its also important to give enough visibility
>> to users to enable them to debug issues and resolve bottleneck.
>> There is also no visibility into how spark utilizes shuffle memory space
>> vs user memory space vs cache space. It's a separate topic though. If
>> everything is working magically as a black box then it's fine but when you
>> have large number of people on this site complaining about  OOM and shuffle
>> error all the time you need to start providing some transparency to
>> address that.
>>
>> Thanks
>>
>>
>> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
>> wrote:
>>
>>> You appear to be misunderstanding the nature of a Stage.  Individual
>>> transformation steps such as `map` do not define the boundaries of Stages.
>>> Rather, a sequence of transformations in which there is only a
>>> NarrowDependency between each of the transformations will be pipelined into
>>> a single Stage.  It is only when there is a ShuffleDependency that a new
>>> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
>>> With whole stage code gen in Spark 2.0, there will be even less opportunity
>>> to treat individual transformations within a sequence of narrow
>>> dependencies as though they were discrete, separable entities.  The Failed
>>> Stages portion of the Web UI will tell you which Stage in a Job failed, and
>>> the accompanying error log message will generally also give you some idea
>>> of which Tasks failed and why.  Tracing the error back further and at a
>>> different level of abstraction to lay blame on a particular transformation
>>> wouldn't be particularly easy.
>>>
>>> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
>>> wrote:
>>>
 It's great that spark scheduler does optimized DAG processing and only
 does lazy eval when some action is performed or shuffle dependency is
 encountered. Sometime it goes further after shuffle dep before executing
 anything. e.g. if there are map steps after shuffle then it doesn't stop at
 shuffle to execute anything but goes to that next map steps until it finds
 a reason(spark action) to execute. As a result stage that spark is running
 can be internally series of (map -> shuffle -> map -> map -> collect) and
 spark UI just shows its currently running 'collect' stage. SO  if job fails
 at that point 

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Takeshi Yamamuro
Hi,

How about this?
--
val func = udf((i: Int) => Tuple2(i, i))
val df = Seq((1, 0), (2, 5)).toDF("a", "b")
df.select(func($"a").as("r")).select($"r._1", $"r._2")

// maropu


On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers  wrote:

> hello all,
>
> i have a single udf that creates 2 outputs (so a tuple 2). i would like to
> add these 2 columns to my dataframe.
>
> my current solution is along these lines:
> df
>   .withColumn("_temp_", udf(inputColumns))
>   .withColumn("x", col("_temp_)("_1"))
>   .withColumn("y", col("_temp_")("_2"))
>   .drop("_temp_")
>
> this works, but its not pretty with the temporary field stuff.
>
> i also tried this:
> val tmp = udf(inputColumns)
> df
>   .withColumn("x", tmp("_1"))
>   .withColumn("y", tmp("_2"))
>
> this also works, but unfortunately the udf is evaluated twice
>
> is there a better way to do this?
>
> thanks! koert
>



-- 
---
Takeshi Yamamuro


Re: Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-05-25 Thread Selvam Raman
XGBoost4J could integrate with spark from 1.6 version.

Currently I am using spark 1.5.2. Can I use XGBoost instead of XGBoost4j.

Will both provides same result.

Thanks,
Selvam R
+91-97877-87724
On Mar 15, 2016 9:23 PM, "Nan Zhu"  wrote:

> Dear Spark Users and Developers,
>
> We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are
> happy to announce the release of XGBoost4J (
> http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
> a Portable Distributed XGBoost in Spark, Flink and Dataflow
>
> XGBoost is an optimized distributed gradient boosting library designed to
> be highly *efficient*, *flexible* and *portable*.XGBoost provides a
> parallel tree boosting (also known as GBDT, GBM) that solve many data
> science problems in a fast and accurate way. It has been the winning
> solution for many machine learning scenarios, ranging from Machine Learning
> Challenges (
> https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)
> to Industrial User Cases (
> https://github.com/dmlc/xgboost/tree/master/demo#usecases)
>
> *XGBoost4J* is a new package in XGBoost aiming to provide the clean
> Scala/Java APIs and the seamless integration with the mainstream data
> processing platform, like Apache Spark. With XGBoost4J, users can run
> XGBoost as a stage of Spark job and build a unified pipeline from ETL to
> Model training to data product service within Spark, instead of jumping
> across two different systems, i.e. XGBoost and Spark. (Example:
> https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala
> )
>
> Today, we release the first version of XGBoost4J to bring more choices to
> the Spark users who are seeking the solutions to build highly efficient
> data analytic platform and enrich the Spark ecosystem. We will keep moving
> forward to integrate with more features of Spark. Of course, you are more
> than welcome to join us and contribute to the project!
>
> For more details of distributed XGBoost, you can refer to the
> recently published paper: http://arxiv.org/abs/1603.02754
>
> Best,
>
> --
> Nan Zhu
> http://codingcat.me
>


Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
But when you talk about optimizing the DAG, it really doesn't make sense to
also talk about transformation steps as separate entities.  The
DAGScheduler knows about Jobs, Stages, TaskSets and Tasks.  The
TaskScheduler knows about TaskSets ad Tasks.  Neither of them understands
the transformation steps that you used to define your RDD -- at least not
as separable, distinct steps.  To give the kind of
transformation-step-oriented information that you want would require parts
of Spark that don't currently concern themselves at all with RDD
transformation steps to start tracking them and how they map to Jobs,
Stages, TaskSets and Tasks -- and when you start talking about Datasets and
Spark SQL, you then needing to start talking about tracking and mapping
concepts like Plans, Schemas and Queries.  It would introduce significant
new complexity.

On Wed, May 25, 2016 at 6:59 PM, Nirav Patel  wrote:

> Hi Mark,
>
> I might have said stage instead of step in my last statement "UI just
> says Collect failed but in fact it could be any stage in that lazy chain of
> evaluation."
>
> Anyways even you agree that this visibility of underlaying steps wont't be
> available. which does pose difficulties in terms of troubleshooting as well
> as optimizations at step level. I think users will have hard time without
> this. Its great that spark community working on different levels of
> internal optimizations but its also important to give enough visibility
> to users to enable them to debug issues and resolve bottleneck.
> There is also no visibility into how spark utilizes shuffle memory space
> vs user memory space vs cache space. It's a separate topic though. If
> everything is working magically as a black box then it's fine but when you
> have large number of people on this site complaining about  OOM and shuffle
> error all the time you need to start providing some transparency to
> address that.
>
> Thanks
>
>
> On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
> wrote:
>
>> You appear to be misunderstanding the nature of a Stage.  Individual
>> transformation steps such as `map` do not define the boundaries of Stages.
>> Rather, a sequence of transformations in which there is only a
>> NarrowDependency between each of the transformations will be pipelined into
>> a single Stage.  It is only when there is a ShuffleDependency that a new
>> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
>> With whole stage code gen in Spark 2.0, there will be even less opportunity
>> to treat individual transformations within a sequence of narrow
>> dependencies as though they were discrete, separable entities.  The Failed
>> Stages portion of the Web UI will tell you which Stage in a Job failed, and
>> the accompanying error log message will generally also give you some idea
>> of which Tasks failed and why.  Tracing the error back further and at a
>> different level of abstraction to lay blame on a particular transformation
>> wouldn't be particularly easy.
>>
>> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
>> wrote:
>>
>>> It's great that spark scheduler does optimized DAG processing and only
>>> does lazy eval when some action is performed or shuffle dependency is
>>> encountered. Sometime it goes further after shuffle dep before executing
>>> anything. e.g. if there are map steps after shuffle then it doesn't stop at
>>> shuffle to execute anything but goes to that next map steps until it finds
>>> a reason(spark action) to execute. As a result stage that spark is running
>>> can be internally series of (map -> shuffle -> map -> map -> collect) and
>>> spark UI just shows its currently running 'collect' stage. SO  if job fails
>>> at that point spark UI just says Collect failed but in fact it could be any
>>> stage in that lazy chain of evaluation. Looking at executor logs gives some
>>> insights but that's not always straightforward.
>>> Correct me if I am wrong here but I think we need more visibility into
>>> what's happening underneath so we can easily troubleshoot as well as
>>> optimize our DAG.
>>>
>>> THanks
>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>   [image: Twitter]
>>>   [image: Facebook]
>>>   [image: YouTube]
>>> 
>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>


Kafka connection logs in Spark

2016-05-25 Thread Mail.com
Hi All,

I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran spark 
streaming in debug mode, but do not see any log saying it connected to Kafka or 
 topic etc. How could I enable that. 

My spark streaming job runs but no messages are fetched from the RDD. Please 
suggest.

Thanks,
Pradeep

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
Hi Mark,

I might have said stage instead of step in my last statement "UI just says
Collect failed but in fact it could be any stage in that lazy chain of
evaluation."

Anyways even you agree that this visibility of underlaying steps wont't be
available. which does pose difficulties in terms of troubleshooting as well
as optimizations at step level. I think users will have hard time without
this. Its great that spark community working on different levels of
internal optimizations but its also important to give enough visibility to
users to enable them to debug issues and resolve bottleneck.
There is also no visibility into how spark utilizes shuffle memory space vs
user memory space vs cache space. It's a separate topic though. If
everything is working magically as a black box then it's fine but when you
have large number of people on this site complaining about  OOM and shuffle
error all the time you need to start providing some transparency to address
that.

Thanks


On Wed, May 25, 2016 at 6:41 PM, Mark Hamstra 
wrote:

> You appear to be misunderstanding the nature of a Stage.  Individual
> transformation steps such as `map` do not define the boundaries of Stages.
> Rather, a sequence of transformations in which there is only a
> NarrowDependency between each of the transformations will be pipelined into
> a single Stage.  It is only when there is a ShuffleDependency that a new
> Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
> With whole stage code gen in Spark 2.0, there will be even less opportunity
> to treat individual transformations within a sequence of narrow
> dependencies as though they were discrete, separable entities.  The Failed
> Stages portion of the Web UI will tell you which Stage in a Job failed, and
> the accompanying error log message will generally also give you some idea
> of which Tasks failed and why.  Tracing the error back further and at a
> different level of abstraction to lay blame on a particular transformation
> wouldn't be particularly easy.
>
> On Wed, May 25, 2016 at 5:28 PM, Nirav Patel 
> wrote:
>
>> It's great that spark scheduler does optimized DAG processing and only
>> does lazy eval when some action is performed or shuffle dependency is
>> encountered. Sometime it goes further after shuffle dep before executing
>> anything. e.g. if there are map steps after shuffle then it doesn't stop at
>> shuffle to execute anything but goes to that next map steps until it finds
>> a reason(spark action) to execute. As a result stage that spark is running
>> can be internally series of (map -> shuffle -> map -> map -> collect) and
>> spark UI just shows its currently running 'collect' stage. SO  if job fails
>> at that point spark UI just says Collect failed but in fact it could be any
>> stage in that lazy chain of evaluation. Looking at executor logs gives some
>> insights but that's not always straightforward.
>> Correct me if I am wrong here but I think we need more visibility into
>> what's happening underneath so we can easily troubleshoot as well as
>> optimize our DAG.
>>
>> THanks
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage.  Individual
transformation steps such as `map` do not define the boundaries of Stages.
Rather, a sequence of transformations in which there is only a
NarrowDependency between each of the transformations will be pipelined into
a single Stage.  It is only when there is a ShuffleDependency that a new
Stage will be defined -- i.e. shuffle boundaries define Stage boundaries.
With whole stage code gen in Spark 2.0, there will be even less opportunity
to treat individual transformations within a sequence of narrow
dependencies as though they were discrete, separable entities.  The Failed
Stages portion of the Web UI will tell you which Stage in a Job failed, and
the accompanying error log message will generally also give you some idea
of which Tasks failed and why.  Tracing the error back further and at a
different level of abstraction to lay blame on a particular transformation
wouldn't be particularly easy.

On Wed, May 25, 2016 at 5:28 PM, Nirav Patel  wrote:

> It's great that spark scheduler does optimized DAG processing and only
> does lazy eval when some action is performed or shuffle dependency is
> encountered. Sometime it goes further after shuffle dep before executing
> anything. e.g. if there are map steps after shuffle then it doesn't stop at
> shuffle to execute anything but goes to that next map steps until it finds
> a reason(spark action) to execute. As a result stage that spark is running
> can be internally series of (map -> shuffle -> map -> map -> collect) and
> spark UI just shows its currently running 'collect' stage. SO  if job fails
> at that point spark UI just says Collect failed but in fact it could be any
> stage in that lazy chain of evaluation. Looking at executor logs gives some
> insights but that's not always straightforward.
> Correct me if I am wrong here but I think we need more visibility into
> what's happening underneath so we can easily troubleshoot as well as
> optimize our DAG.
>
> THanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


User impersonation with Kerberos and Delegation tokens

2016-05-25 Thread Sudarshan Rangarajan
Hi there,

I'm using SparkLauncher API from Spark v1.6.1, to submit a Spark job to
YARN. The service from where this API will be invoked will need to talk to
other services on our cluster via Kerberos (ex. HDFS, YARN etc.). Also, my
service expects to impersonate its then logged-in user during job
submission to YARN.

Can someone comment on/confirm, the following two observations ?

1. From this Github PR
,
it seems that SparkSubmit doesn't support user impersonation when Kerberos
credentials of the real user are specified.

2. SPARK-14743  - It
seems that Spark doesn't yet support acquisition of delegation tokens
during job submission to YARN irrespective of user impersonation. This Github
discussion  is also relevant to
this scenario.

Thanks,
Sudarshan


Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
It's great that spark scheduler does optimized DAG processing and only does
lazy eval when some action is performed or shuffle dependency is
encountered. Sometime it goes further after shuffle dep before executing
anything. e.g. if there are map steps after shuffle then it doesn't stop at
shuffle to execute anything but goes to that next map steps until it finds
a reason(spark action) to execute. As a result stage that spark is running
can be internally series of (map -> shuffle -> map -> map -> collect) and
spark UI just shows its currently running 'collect' stage. SO  if job fails
at that point spark UI just says Collect failed but in fact it could be any
stage in that lazy chain of evaluation. Looking at executor logs gives some
insights but that's not always straightforward.
Correct me if I am wrong here but I think we need more visibility into
what's happening underneath so we can easily troubleshoot as well as
optimize our DAG.

THanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: never understand

2016-05-25 Thread Andrew Ehrlich
- Try doing less in each transformation
- Try using different data structures within the transformations
- Try not caching anything to free up more memory


On Wed, May 25, 2016 at 1:32 AM, pseudo oduesp 
wrote:

> hi guys ,
> -i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn)
>
> -i use yarn to deploy job on cluster.
> -i use hive context  and parquet file to save my data.
> limit container 16 GB
> number of executor i tested befor it s 12 GB (executor memory)
> -i tested  to increase number of partitions (by default it s 200) i
> multipie by 2 and 3  whitout succes.
>
> -I try to change number of sql partitins shuffle
>
>
> -i remarque in spark UI when (shuffle write it triggerd no problem) but
> (when shuffle read triggerd i lost executors and get erros)
>
>
>
> and realy blocked by this error  where she came from
>
>
>
>
>  ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread
> Thread[Executor task launch worker-5,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at
> parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
> at parquet.column.values.dictionary.IntList.(IntList.java:86)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter.(DictionaryValuesWriter.java:93)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.(DictionaryValuesWriter.java:229)
> at
> parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
> at
> parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
> at
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
> at
> parquet.column.impl.ColumnWriterV1.(ColumnWriterV1.java:84)
> at
> parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
> at
> parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
> at
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.(MessageColumnIO.java:207)
> at
> parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
> at
> parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
> at
> parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
>  at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
> at
> parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
> at
> parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 16/05/25 09:54:42 ERROR util.SparkUncaughtExceptionHandler: Uncaught
> exception in thread Thread[Executor task launch worker-6,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at
> parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
>   

Re: Preference and confidence in ALS implicit preferences output?

2016-05-25 Thread Sean Owen
This isn't specific to Spark, but there is not a direct relation. An
input preference is a count-like value, which is converted into to
confidence values via the 1 + alpha*value formula. But the matrix that
is factored is the 0/1 matrix mentioned in the paper, and the
resulting 'prediction' are elements from the product of the factored
matrices. So it's a value that's mostly in [0,1] but not always, and
is not a preference or confidence.

On Wed, May 25, 2016 at 11:50 AM, edezhath  wrote:
> The original paper that the implicit preferences version of ALS is based on,
> mentions a "preference" and "confidence" for each user-item pair. But
> spark.ml.recommender.ALS only outputs a "prediction". How is this related to
> preference and confidence?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Preference-and-confidence-in-ALS-implicit-preferences-output-tp27023.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
oh yes, this was by accident, it should have gone to dev

On Wed, May 25, 2016 at 4:20 PM, Reynold Xin  wrote:

> Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533
>
> @Koert - Please keep API feedback coming. One thing - in the future, can
> you send api feedbacks to the dev@ list instead of user@?
>
>
>
> On Wed, May 25, 2016 at 1:05 PM, Cheng Lian  wrote:
>
>> Agree, since they can be easily replaced by .flatMap (to do explosion)
>> and .select (to rename output columns)
>>
>> Cheng
>>
>>
>> On 5/25/16 12:30 PM, Reynold Xin wrote:
>>
>> Based on this discussion I'm thinking we should deprecate the two explode
>> functions.
>>
>> On Wednesday, May 25, 2016, Koert Kuipers < 
>> ko...@tresata.com> wrote:
>>
>>> wenchen,
>>> that definition of explode seems identical to flatMap, so you dont need
>>> it either?
>>>
>>> michael,
>>> i didn't know about the column expression version of explode, that makes
>>> sense. i will experiment with that instead.
>>>
>>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan 
>>> wrote:
>>>
 I think we only need this version:  `def explode[B : Encoder](f: A
 => TraversableOnce[B]): Dataset[B]`

 For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
 the best choice.

 On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> These APIs predate Datasets / encoders, so that is why they are Row
> instead of objects.  We should probably rethink that.
>
> Honestly, I usually end up using the column expression version of
> explode now that it exists (i.e. explode($"arrayCol").as("Item")).
> It would be great to understand more why you are using these instead.
>
> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers 
> wrote:
>
>> we currently have 2 explode definitions in Dataset:
>>
>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>> TraversableOnce[A]): DataFrame
>>
>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>> String)(f: A => TraversableOnce[B]): DataFrame
>>
>> 1) the separation of the functions into their own argument lists is
>> nice, but unfortunately scala's type inference doesn't handle this well,
>> meaning that the generic types always have to be explicitly provided. i
>> assume this was done to allow the "input" to be a varargs in the first
>> method, and then kept the same in the second for reasons of symmetry.
>>
>> 2) i am surprised the first definition returns a DataFrame. this
>> seems to suggest DataFrame usage (so DataFrame to DataFrame), but there 
>> is
>> no way to specify the output column names, which limits its usability for
>> DataFrames. i frequently end up using the first definition for DataFrames
>> anyhow because of the need to return more than 1 column (and the data has
>> columns unknown at compile time that i need to carry along making flatMap
>> on Dataset clumsy/unusable), but relying on the output columns being 
>> called
>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>
>> 3) using Row objects isn't very pretty. why not f: A =>
>> TraversableOnce[B] or something like that for the first definition? how
>> about:
>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>
>> best,
>> koert
>>
>
>

>>>
>>
>


Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533

@Koert - Please keep API feedback coming. One thing - in the future, can
you send api feedbacks to the dev@ list instead of user@?



On Wed, May 25, 2016 at 1:05 PM, Cheng Lian  wrote:

> Agree, since they can be easily replaced by .flatMap (to do explosion) and
> .select (to rename output columns)
>
> Cheng
>
>
> On 5/25/16 12:30 PM, Reynold Xin wrote:
>
> Based on this discussion I'm thinking we should deprecate the two explode
> functions.
>
> On Wednesday, May 25, 2016, Koert Kuipers < 
> ko...@tresata.com> wrote:
>
>> wenchen,
>> that definition of explode seems identical to flatMap, so you dont need
>> it either?
>>
>> michael,
>> i didn't know about the column expression version of explode, that makes
>> sense. i will experiment with that instead.
>>
>> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan 
>> wrote:
>>
>>> I think we only need this version:  `def explode[B : Encoder](f: A
>>> => TraversableOnce[B]): Dataset[B]`
>>>
>>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>>> the best choice.
>>>
>>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 These APIs predate Datasets / encoders, so that is why they are Row
 instead of objects.  We should probably rethink that.

 Honestly, I usually end up using the column expression version of
 explode now that it exists (i.e. explode($"arrayCol").as("Item")).  It
 would be great to understand more why you are using these instead.

 On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers 
 wrote:

> we currently have 2 explode definitions in Dataset:
>
>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
> TraversableOnce[A]): DataFrame
>
>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
> String)(f: A => TraversableOnce[B]): DataFrame
>
> 1) the separation of the functions into their own argument lists is
> nice, but unfortunately scala's type inference doesn't handle this well,
> meaning that the generic types always have to be explicitly provided. i
> assume this was done to allow the "input" to be a varargs in the first
> method, and then kept the same in the second for reasons of symmetry.
>
> 2) i am surprised the first definition returns a DataFrame. this seems
> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no 
> way
> to specify the output column names, which limits its usability for
> DataFrames. i frequently end up using the first definition for DataFrames
> anyhow because of the need to return more than 1 column (and the data has
> columns unknown at compile time that i need to carry along making flatMap
> on Dataset clumsy/unusable), but relying on the output columns being 
> called
> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>
> 3) using Row objects isn't very pretty. why not f: A =>
> TraversableOnce[B] or something like that for the first definition? how
> about:
>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>
> best,
> koert
>


>>>
>>
>


Re: Pros and Cons

2016-05-25 Thread Reynold Xin
On Wed, May 25, 2016 at 9:52 AM, Jörn Franke  wrote:

> Spark is more for machine learning working iteravely over the whole same
> dataset in memory. Additionally it has streaming and graph processing
> capabilities that can be used together.
>

Hi Jörn,

The first part is actually no true. Spark can handle data far greater than
the aggregate memory available on a cluster. The more recent versions
(1.3+) of Spark have external operations for almost all built-in operators,
and while things may not be perfect, those external operators are becoming
more and more robust with each version of Spark.


unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Koert Kuipers
hello all,

i have a single udf that creates 2 outputs (so a tuple 2). i would like to
add these 2 columns to my dataframe.

my current solution is along these lines:
df
  .withColumn("_temp_", udf(inputColumns))
  .withColumn("x", col("_temp_)("_1"))
  .withColumn("y", col("_temp_")("_2"))
  .drop("_temp_")

this works, but its not pretty with the temporary field stuff.

i also tried this:
val tmp = udf(inputColumns)
df
  .withColumn("x", tmp("_1"))
  .withColumn("y", tmp("_2"))

this also works, but unfortunately the udf is evaluated twice

is there a better way to do this?

thanks! koert


Re: feedback on dataset api explode

2016-05-25 Thread Cheng Lian
Agree, since they can be easily replaced by .flatMap (to do explosion) 
and .select (to rename output columns)


Cheng

On 5/25/16 12:30 PM, Reynold Xin wrote:
Based on this discussion I'm thinking we should deprecate the two 
explode functions.


On Wednesday, May 25, 2016, Koert Kuipers > wrote:


wenchen,
that definition of explode seems identical to flatMap, so you dont
need it either?

michael,
i didn't know about the column expression version of explode, that
makes sense. i will experiment with that instead.

On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan
> wrote:

I think we only need this version:  `def explode[B :
Encoder](f: A => TraversableOnce[B]): Dataset[B]`

For untyped one, `df.select(explode($"arrayCol").as("item"))`
should be the best choice.

On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust
> wrote:

These APIs predate Datasets / encoders, so that is why
they are Row instead of objects.  We should probably
rethink that.

Honestly, I usually end up using the column expression
version of explode now that it exists (i.e.
explode($"arrayCol").as("Item")). It would be great to
understand more why you are using these instead.

On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers
> wrote:

we currently have 2 explode definitions in Dataset:

 def explode[A <: Product : TypeTag](input:
Column*)(f: Row => TraversableOnce[A]): DataFrame

 def explode[A, B : TypeTag](inputColumn: String,
outputColumn: String)(f: A => TraversableOnce[B]):
DataFrame

1) the separation of the functions into their own
argument lists is nice, but unfortunately scala's type
inference doesn't handle this well, meaning that the
generic types always have to be explicitly provided. i
assume this was done to allow the "input" to be a
varargs in the first method, and then kept the same in
the second for reasons of symmetry.

2) i am surprised the first definition returns a
DataFrame. this seems to suggest DataFrame usage (so
DataFrame to DataFrame), but there is no way to
specify the output column names, which limits its
usability for DataFrames. i frequently end up using
the first definition for DataFrames anyhow because of
the need to return more than 1 column (and the data
has columns unknown at compile time that i need to
carry along making flatMap on Dataset
clumsy/unusable), but relying on the output columns
being called _1 and _2 and renaming then afterwards
seems like an anti-pattern.

3) using Row objects isn't very pretty. why not f: A
=> TraversableOnce[B] or something like that for the
first definition? how about:
 def explode[A: TypeTag, B: TypeTag](input:
Seq[Column], output: Seq[Column])(f: A =>
TraversableOnce[B]): DataFrame

best,
koert








Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Based on this discussion I'm thinking we should deprecate the two explode
functions.

On Wednesday, May 25, 2016, Koert Kuipers  wrote:

> wenchen,
> that definition of explode seems identical to flatMap, so you dont need it
> either?
>
> michael,
> i didn't know about the column expression version of explode, that makes
> sense. i will experiment with that instead.
>
> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan  > wrote:
>
>> I think we only need this version:  `def explode[B : Encoder](f: A
>> => TraversableOnce[B]): Dataset[B]`
>>
>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
>> the best choice.
>>
>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust <
>> mich...@databricks.com
>> > wrote:
>>
>>> These APIs predate Datasets / encoders, so that is why they are Row
>>> instead of objects.  We should probably rethink that.
>>>
>>> Honestly, I usually end up using the column expression version of
>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")).  It
>>> would be great to understand more why you are using these instead.
>>>
>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers >> > wrote:
>>>
 we currently have 2 explode definitions in Dataset:

  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
 TraversableOnce[A]): DataFrame

  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
 String)(f: A => TraversableOnce[B]): DataFrame

 1) the separation of the functions into their own argument lists is
 nice, but unfortunately scala's type inference doesn't handle this well,
 meaning that the generic types always have to be explicitly provided. i
 assume this was done to allow the "input" to be a varargs in the first
 method, and then kept the same in the second for reasons of symmetry.

 2) i am surprised the first definition returns a DataFrame. this seems
 to suggest DataFrame usage (so DataFrame to DataFrame), but there is no way
 to specify the output column names, which limits its usability for
 DataFrames. i frequently end up using the first definition for DataFrames
 anyhow because of the need to return more than 1 column (and the data has
 columns unknown at compile time that i need to carry along making flatMap
 on Dataset clumsy/unusable), but relying on the output columns being called
 _1 and _2 and renaming then afterwards seems like an anti-pattern.

 3) using Row objects isn't very pretty. why not f: A =>
 TraversableOnce[B] or something like that for the first definition? how
 about:
  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
 Seq[Column])(f: A => TraversableOnce[B]): DataFrame

 best,
 koert

>>>
>>>
>>
>


Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
wenchen,
that definition of explode seems identical to flatMap, so you dont need it
either?

michael,
i didn't know about the column expression version of explode, that makes
sense. i will experiment with that instead.

On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan  wrote:

> I think we only need this version:  `def explode[B : Encoder](f: A
> => TraversableOnce[B]): Dataset[B]`
>
> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be
> the best choice.
>
> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust  > wrote:
>
>> These APIs predate Datasets / encoders, so that is why they are Row
>> instead of objects.  We should probably rethink that.
>>
>> Honestly, I usually end up using the column expression version of explode
>> now that it exists (i.e. explode($"arrayCol").as("Item")).  It would be
>> great to understand more why you are using these instead.
>>
>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers  wrote:
>>
>>> we currently have 2 explode definitions in Dataset:
>>>
>>>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
>>> TraversableOnce[A]): DataFrame
>>>
>>>  def explode[A, B : TypeTag](inputColumn: String, outputColumn:
>>> String)(f: A => TraversableOnce[B]): DataFrame
>>>
>>> 1) the separation of the functions into their own argument lists is
>>> nice, but unfortunately scala's type inference doesn't handle this well,
>>> meaning that the generic types always have to be explicitly provided. i
>>> assume this was done to allow the "input" to be a varargs in the first
>>> method, and then kept the same in the second for reasons of symmetry.
>>>
>>> 2) i am surprised the first definition returns a DataFrame. this seems
>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no way
>>> to specify the output column names, which limits its usability for
>>> DataFrames. i frequently end up using the first definition for DataFrames
>>> anyhow because of the need to return more than 1 column (and the data has
>>> columns unknown at compile time that i need to carry along making flatMap
>>> on Dataset clumsy/unusable), but relying on the output columns being called
>>> _1 and _2 and renaming then afterwards seems like an anti-pattern.
>>>
>>> 3) using Row objects isn't very pretty. why not f: A =>
>>> TraversableOnce[B] or something like that for the first definition? how
>>> about:
>>>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>>>
>>> best,
>>> koert
>>>
>>
>>
>


Re: feedback on dataset api explode

2016-05-25 Thread Michael Armbrust
These APIs predate Datasets / encoders, so that is why they are Row instead
of objects.  We should probably rethink that.

Honestly, I usually end up using the column expression version of explode
now that it exists (i.e. explode($"arrayCol").as("Item")).  It would be
great to understand more why you are using these instead.

On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers  wrote:

> we currently have 2 explode definitions in Dataset:
>
>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
> TraversableOnce[A]): DataFrame
>
>  def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f:
> A => TraversableOnce[B]): DataFrame
>
> 1) the separation of the functions into their own argument lists is nice,
> but unfortunately scala's type inference doesn't handle this well, meaning
> that the generic types always have to be explicitly provided. i assume this
> was done to allow the "input" to be a varargs in the first method, and then
> kept the same in the second for reasons of symmetry.
>
> 2) i am surprised the first definition returns a DataFrame. this seems to
> suggest DataFrame usage (so DataFrame to DataFrame), but there is no way to
> specify the output column names, which limits its usability for DataFrames.
> i frequently end up using the first definition for DataFrames anyhow
> because of the need to return more than 1 column (and the data has columns
> unknown at compile time that i need to carry along making flatMap on
> Dataset clumsy/unusable), but relying on the output columns being called _1
> and _2 and renaming then afterwards seems like an anti-pattern.
>
> 3) using Row objects isn't very pretty. why not f: A => TraversableOnce[B]
> or something like that for the first definition? how about:
>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>
> best,
> koert
>


Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Ah right i see.

Thank you very much.
On May 25, 2016 11:11 AM, "Cody Koeninger"  wrote:

> There's an overloaded createDirectStream method that takes a map from
> topicpartition to offset for the starting point of the stream.
>
> On Wed, May 25, 2016 at 9:59 AM, trung kien  wrote:
> > Thank Cody.
> >
> > I can build the mapping from time ->offset. However how can i pass this
> > offset to Spark Streaming job using that offset? ( using Direct Approach)
> >
> > On May 25, 2016 9:42 AM, "Cody Koeninger"  wrote:
> >>
> >> Kafka does not yet have meaningful time indexing, there's a kafka
> >> improvement proposal for it but it has gotten pushed back to at least
> >> 0.10.1
> >>
> >> If you want to do this kind of thing, you will need to maintain your
> >> own index from time to offset.
> >>
> >> On Wed, May 25, 2016 at 8:15 AM, trung kien  wrote:
> >> > Hi all,
> >> >
> >> > Is there any way to re-compute using Spark Streaming - Kafka Direct
> >> > Approach
> >> > from specific time?
> >> >
> >> > In some cases, I want to re-compute again from specific time (e.g
> >> > beginning
> >> > of day)? is that possible?
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks
> >> > Kien
>


Re: Not able to write output to local filsystem from Standalone mode.

2016-05-25 Thread Mathieu Longtin
Experience. I don't use Mesos or Yarn or Hadoop, so I don't know.


On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski  wrote:

> Hi Mathieu,
>
> Thanks a lot for the answer! I did *not* know it's the driver to
> create the directory.
>
> You said "standalone mode", is this the case for the other modes -
> yarn and mesos?
>
> p.s. Did you find it in the code or...just experienced before? #curious
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin 
> wrote:
> > In standalone mode, executor assume they have access to a shared file
> > system. The driver creates the directory and the executor write files, so
> > the executors end up not writing anything since there is no local
> directory.
> >
> > On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi 
> wrote:
> >>
> >> hi Jacek,
> >>
> >> Parent directory already present, its my home directory. Im using Linux
> >> (Redhat) machine 64 bit.
> >> Also I noticed that "test1" folder is created in my master with
> >> subdirectory as "_temporary" which is empty. but on slaves, no such
> >> directory is created under /home/stuti.
> >>
> >> Thanks
> >> Stuti
> >> 
> >> From: Jacek Laskowski [ja...@japila.pl]
> >> Sent: Tuesday, May 24, 2016 5:27 PM
> >> To: Stuti Awasthi
> >> Cc: user
> >> Subject: Re: Not able to write output to local filsystem from Standalone
> >> mode.
> >>
> >> Hi,
> >>
> >> What happens when you create the parent directory /home/stuti? I think
> the
> >> failure is due to missing parent directories. What's the OS?
> >>
> >> Jacek
> >>
> >> On 24 May 2016 11:27 a.m., "Stuti Awasthi" 
> wrote:
> >>
> >> Hi All,
> >>
> >> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
> >> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
> >> shell , read the input file from local filesystem and perform
> transformation
> >> successfully. When I try to write my output in local filesystem path
> then I
> >> receive below error .
> >>
> >>
> >>
> >> I tried to search on web and found similar Jira :
> >> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
> >> resolved for Spark 1.3+ but already people have posted the same issue
> still
> >> persists in latest versions.
> >>
> >>
> >>
> >> ERROR
> >>
> >> scala> data.saveAsTextFile("/home/stuti/test1")
> >>
> >> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID
> 2,
> >> server1): java.io.IOException: The temporary job-output directory
> >> file:/home/stuti/test1/_temporary doesn't exist!
> >>
> >> at
> >>
> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
> >>
> >> at
> >>
> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
> >>
> >> at
> >>
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
> >>
> >> at
> >> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
> >>
> >> at
> >>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
> >>
> >> at
> >>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
> >>
> >> at
> >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> >>
> >> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> >>
> >> at
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> >>
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >>
> >>
> >> What is the best way to resolve this issue if suppose I don’t want to
> have
> >> Hadoop installed OR is it mandatory to have Hadoop to write the output
> from
> >> Standalone cluster mode.
> >>
> >>
> >>
> >> Please suggest.
> >>
> >>
> >>
> >> Thanks 
> >>
> >> Stuti Awasthi
> >>
> >>
> >>
> >>
> >>
> >> ::DISCLAIMER::
> >>
> >>
> 
> >>
> >> The contents of this e-mail and any attachment(s) are confidential and
> >> intended for the named recipient(s) only.
> >> E-mail transmission is not guaranteed to be secure or error-free as
> >> information could be intercepted, corrupted,
> >> lost, destroyed, arrive late or incomplete, or may contain viruses in
> >> transmission. The e mail and its contents
> >> (with or without referred errors) shall 

The 7th and Largest Spark Summit is less than 2 weeks away!

2016-05-25 Thread Scott walent
*With every Spark Summit, an Apache Spark Community event, increasing
numbers of users and developers attend. This is the seventh Summit, and
whether you believe that “Seven” is the world’s most popular number, we are
offering a special promo code* for all Apache Spark users and developers on
this list: SparkSummit7*


*Register (http://spark-summit.org/2016 )
between now and May 31st with this promo code and get 25% off.**V*alid only
for new registrations.*


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of
doing Cartesian.  Say generate pair rdd from each with first letter as key.
Then do a partition and a join.
On May 25, 2016 8:04 PM, "Priya Ch"  wrote:

> Hi,
>   RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
> like to filter out the strings that have greater than 85% match and
> generate a score for it which is used in the susbsequent calculations.
>
> I tried generating pair rdd from both rdds A and B with same key for all
> the records. Now performing A.join(B) is also resulting in huge execution
> time..
>
> How do I go about with map and reduce here ? To generate pairs from 2 rdds
> I dont think map can be used because we cannot have rdd inside another rdd.
>
> Would be glad if you can throw me some light on this.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 7:39 PM, Jörn Franke  wrote:
>
>> Solr or Elastic search provide much more functionality and are faster in
>> this context. The decision for or against them depends on your current and
>> future use cases. Your current use case is still very abstract so in order
>> to get a more proper recommendation you need to provide more details
>> including size of dataset, what you do with the result of the matching do
>> you just need the match number or also the pairs in the results etc.
>>
>> Your concrete problem can also be solved in Spark (though it is not the
>> best and most efficient tool for this, but it has other strength) using the
>> map reduce steps. There are different ways to implement this (Generate
>> pairs from the input datasets in the map step or (maybe less recommendable)
>> broadcast the smaller dataset to all nodes and do the matching with the
>> bigger dataset there.
>> This highly depends on the data in your data set. How they compare in
>> size etc.
>>
>>
>>
>> On 25 May 2016, at 13:27, Priya Ch  wrote:
>>
>> Why do i need to deploy solr for text anaytics...i have files placed in
>> HDFS. just need to look for matches against each string in both files and
>> generate those records whose match is > 85%. We trying to Fuzzy match
>> logic.
>>
>> How can use map/reduce operations across 2 rdds ?
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke 
>> wrote:
>>
>>>
>>> Alternatively depending on the exact use case you may employ solr on
>>> Hadoop for text analytics
>>>
>>> > On 25 May 2016, at 12:57, Priya Ch 
>>> wrote:
>>> >
>>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>>> B of
>>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>>> need
>>> > to check the matches found in rdd B as such for string "hi" i have to
>>> check
>>> > the matches against all strings in RDD B which means I need generate
>>> every
>>> > possible combination r
>>>
>>
>>
>


GraphFrame graph partitioning

2016-05-25 Thread rohit13k
How to do graph partition in GraphFrames similar to the partitionBy feature
in GraphX? Can we use the Dataframe's repartition feature in 1.6 to provide
a graph partitioning in graphFrames?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-graph-partitioning-tp27024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pros and Cons

2016-05-25 Thread Jörn Franke

Hive has a little bit more emphasis on the case that your data that is queried 
is much bigger than available memory or when you need to query many different 
small data subsets or recently interactively queries (llap  etc.). 

Spark is more for machine learning working iteravely over the whole same 
dataset in memory. Additionally it has streaming and graph processing 
capabilities that can be used together. 

Besides this depending on your needs other ecosystem components are relevant. 
For instance, both are less good with lookups of single entries in a dataset. 
They are not so good for text analytics.

Said that both develop rapidly and this may change. Additionally both have 
replacements , such as Flink for Spark etc



Sent from my iPhone
> On 25 May 2016, at 18:11, Mich Talebzadeh  wrote:
> 
> Can you be a bit more specific how are you going to use Spark. For example as 
> a powerful query tool, Analytics, Data migration.
> 
> Spark SQL and Spark-shell provide a subset of Hive SQL (depending on which 
> version of Hive and Spark you have in mind).
> 
> As a query tool Spark is very powerful as it uses DAG and In-memory 
> computation, provides Scala (and others) as the language. You can create your 
> own uber JAR fie for distribution etc
> You can of course use Spark as an execution engine for Hive as opposed to 
> map-reduce to take advantage of Spark processing
> 
> etc etc
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 25 May 2016 at 16:34, Aakash Basu  wrote:
>> Hi,
>> 
>>  
>> 
>> I’m new to the Spark Ecosystem, need to understand the Pros and Cons of 
>> fetching data using SparkSQL vs Hive in Spark vs Spark API.
>> 
>>  
>> 
>> PLEASE HELP!
>> 
>>  
>> 
>> Thanks,
>> 
>> Aakash Basu.
>> 
> 


Preference and confidence in ALS implicit preferences output?

2016-05-25 Thread edezhath
The original paper that the implicit preferences version of ALS is based on,
mentions a "preference" and "confidence" for each user-item pair. But
spark.ml.recommender.ALS only outputs a "prediction". How is this related to
preference and confidence?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Preference-and-confidence-in-ALS-implicit-preferences-output-tp27023.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sparkApp on standalone/local mode with multithreading

2016-05-25 Thread sujeet jog
I had few questions w.r.t to Spark deployment & and way i want to use, It
would be helpful if you can answer few.

I plan to use Spark on a embedded switch, which has limited set of
resources,  like say 1 or 2  dedicated cores  and 1.5GB of memory,
want to model a network traffic with time series algorithms,  the
algorithms i want to use currently do no exist in spark, so i'm writing it
using R,
I plan to use Pipe to get this executed from Spark.

The reason i'm using Spark other then the ETL functions is because of
portability, so that the same code can be reused on a x86 platform with
more CPU & memory resources if required.

within the same Application i would like to create multiple threads , one
thread doining the testing of ML , and other training at some specific
time, and perhaps some other thread to do any other activity if required,

Can you please let me know if you see any apparent issues from your
experience on spark with this kind of design.


Re: Accumulators displayed in SparkUI in 1.4.1?

2016-05-25 Thread Jacek Laskowski
On 25 May 2016 6:00 p.m., "Daniel Barclay" 
wrote:
>
> Was the feature of displaying accumulators in the Spark UI implemented in
Spark 1.4.1, or was that added later?

Dunno, but only *named* *accumulators* are displayed in Spark’s webUI
(under Stages tab for a given stage).

Jacek


Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
There's an overloaded createDirectStream method that takes a map from
topicpartition to offset for the starting point of the stream.

On Wed, May 25, 2016 at 9:59 AM, trung kien  wrote:
> Thank Cody.
>
> I can build the mapping from time ->offset. However how can i pass this
> offset to Spark Streaming job using that offset? ( using Direct Approach)
>
> On May 25, 2016 9:42 AM, "Cody Koeninger"  wrote:
>>
>> Kafka does not yet have meaningful time indexing, there's a kafka
>> improvement proposal for it but it has gotten pushed back to at least
>> 0.10.1
>>
>> If you want to do this kind of thing, you will need to maintain your
>> own index from time to offset.
>>
>> On Wed, May 25, 2016 at 8:15 AM, trung kien  wrote:
>> > Hi all,
>> >
>> > Is there any way to re-compute using Spark Streaming - Kafka Direct
>> > Approach
>> > from specific time?
>> >
>> > In some cases, I want to re-compute again from specific time (e.g
>> > beginning
>> > of day)? is that possible?
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Kien

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pros and Cons

2016-05-25 Thread Mich Talebzadeh
Can you be a bit more specific how are you going to use Spark. For example
as a powerful query tool, Analytics, Data migration.

Spark SQL and Spark-shell provide a subset of Hive SQL (depending on which
version of Hive and Spark you have in mind).

As a query tool Spark is very powerful as it uses DAG and In-memory
computation, provides Scala (and others) as the language. You can create
your own uber JAR fie for distribution etc
You can of course use Spark as an execution engine for Hive as opposed to
map-reduce to take advantage of Spark processing

etc etc

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 25 May 2016 at 16:34, Aakash Basu  wrote:

> Hi,
>
>
>
> I’m new to the Spark Ecosystem, need to understand the *Pros and Cons *of
> fetching data using *SparkSQL vs Hive in Spark vs Spark API.*
>
>
>
> *PLEASE HELP!*
>
>
>
> Thanks,
>
> Aakash Basu.
>


Accumulators displayed in SparkUI in 1.4.1?

2016-05-25 Thread Daniel Barclay

Was the feature of displaying accumulators in the Spark UI implemented in Spark 
1.4.1, or was that added later?

Thanks,
Daniel




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
we currently have 2 explode definitions in Dataset:

 def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
TraversableOnce[A]): DataFrame

 def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f:
A => TraversableOnce[B]): DataFrame

1) the separation of the functions into their own argument lists is nice,
but unfortunately scala's type inference doesn't handle this well, meaning
that the generic types always have to be explicitly provided. i assume this
was done to allow the "input" to be a varargs in the first method, and then
kept the same in the second for reasons of symmetry.

2) i am surprised the first definition returns a DataFrame. this seems to
suggest DataFrame usage (so DataFrame to DataFrame), but there is no way to
specify the output column names, which limits its usability for DataFrames.
i frequently end up using the first definition for DataFrames anyhow
because of the need to return more than 1 column (and the data has columns
unknown at compile time that i need to carry along making flatMap on
Dataset clumsy/unusable), but relying on the output columns being called _1
and _2 and renaming then afterwards seems like an anti-pattern.

3) using Row objects isn't very pretty. why not f: A => TraversableOnce[B]
or something like that for the first definition? how about:
 def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
Seq[Column])(f: A => TraversableOnce[B]): DataFrame

best,
koert


Pros and Cons

2016-05-25 Thread Aakash Basu
Hi,



I’m new to the Spark Ecosystem, need to understand the *Pros and Cons *of
fetching data using *SparkSQL vs Hive in Spark vs Spark API.*



*PLEASE HELP!*



Thanks,

Aakash Basu.


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
Hi Matthias and Cody, thanks for the answer. This is the code that is
raising the runtime exception:

try{
messages.foreachRDD( rdd =>{
  val count = rdd.count()
  if (count > 0){
//someMessages should be AmazonRating...
val someMessages = rdd.take(count.toInt)
println("<-->")
println("someMessages is " + someMessages)
someMessages.foreach(println)
println("<-->")
println("<---POSSIBLE SOLUTION--->")
messages
.map { case (_, jsonRating) =>
  val jsValue = Json.parse(jsonRating)
  AmazonRating.amazonRatingFormat.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
  }
 }
.filter(_ != AmazonRating.empty)
*//this line raises the runtime error, but if i comment it another
different runtime exception happens!*
.foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq)))
println("<---POSSIBLE SOLUTION--->")
  }
  }
)
}catch{
  case e: IllegalArgumentException => {println("illegal arg.
exception")};
  case e: IllegalStateException=> {println("illegal state
exception")};
  case e: ClassCastException   => {println("ClassCastException")};
  case e: Exception=> {println(" Generic Exception")};
}finally{

  println("Finished taking data from kafka topic...")
}

If i comment the line with the second foreachRDD, the next runtime
exception happens within a fresh start, i mean, the kafka producer push
data within the topic:

16/05/25 17:26:12 ERROR JobScheduler: Error running job streaming job
1464189972000 ms.0

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.ClassCastException:
org.apache.spark.util.SerializableConfiguration cannot be cast to [B

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

If i push another json data within the topic, the next exception happens:

16/05/25 17:27:16 INFO DAGScheduler: Job 1 finished: runJob at
KafkaRDD.scala:98, took 0,039689 s

<-->

someMessages is [Lscala.Tuple2;@712ca120

(null,{"userId":"someUserId","productId":"0981531679","rating":9.0})

<-->

<---POSSIBLE SOLUTION--->

16/05/25 17:27:16 INFO JobScheduler: Finished job streaming job
1464190036000 ms.0 from job set of time 1464190036000 ms

16/05/25 17:27:16 INFO JobScheduler: Total delay: 0,063 s for time
1464190036000 ms (execution: 0,055 s)

16/05/25 17:27:16 INFO KafkaRDD: Removing RDD 43 from persistence list

16/05/25 17:27:16 ERROR JobScheduler: Error running job streaming job
1464190036000 ms.0

java.lang.IllegalStateException: Adding new inputs, transformations, and
output operations after starting a context is not supported

at
org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)

at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)

at
org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)

at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)

at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)

at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:557)

at
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:125)

at
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:114)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)

at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)

at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)

at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)

at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)

at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)

at

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Thank Cody.

I can build the mapping from time ->offset. However how can i pass this
offset to Spark Streaming job using that offset? ( using Direct Approach)
On May 25, 2016 9:42 AM, "Cody Koeninger"  wrote:

> Kafka does not yet have meaningful time indexing, there's a kafka
> improvement proposal for it but it has gotten pushed back to at least
> 0.10.1
>
> If you want to do this kind of thing, you will need to maintain your
> own index from time to offset.
>
> On Wed, May 25, 2016 at 8:15 AM, trung kien  wrote:
> > Hi all,
> >
> > Is there any way to re-compute using Spark Streaming - Kafka Direct
> Approach
> > from specific time?
> >
> > In some cases, I want to re-compute again from specific time (e.g
> beginning
> > of day)? is that possible?
> >
> >
> >
> > --
> > Thanks
> > Kien
>


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Matthias Niehoff
Hi,

you register some output actions (in this case foreachRDD) after starting
the streaming context. StreamingContext.start() has to be called after all!
output actions.

2016-05-25 15:59 GMT+02:00 Alonso :

> Hi, i am receiving this exception when direct spark streaming process
> tries to pull data from kafka topic:
>
> 16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863
> ms saved to file
> 'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863',
> took 5928 bytes and 8 ms
>
> 16/05/25 11:30:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1041 
> bytes result sent to driver
> 16/05/25 11:30:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) 
> in 4 ms on localhost (1/1)
> 16/05/25 11:30:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool
> 16/05/25 11:30:30 INFO DAGScheduler: ResultStage 2 (runJob at 
> KafkaRDD.scala:98) finished in 0,004 s
> 16/05/25 11:30:30 INFO DAGScheduler: Job 2 finished: runJob at 
> KafkaRDD.scala:98, took 0,008740 s
> <-->
> someMessages is [Lscala.Tuple2;@2641d687
> (null,{"userId":"someUserId","productId":"0981531679","rating":6.0})
> <-->
> <---POSSIBLE SOLUTION--->
> 16/05/25 11:30:30 INFO JobScheduler: Finished job streaming job 146416863 
> ms.0 from job set of time 146416863 ms
> 16/05/25 11:30:30 INFO KafkaRDD: Removing RDD 105 from persistence list
> 16/05/25 11:30:30 INFO JobScheduler: Total delay: 0,020 s for time 
> 146416863 ms (execution: 0,012 s)
> 16/05/25 11:30:30 ERROR JobScheduler: Error running job streaming job 
> 146416863 ms.0*java.lang.IllegalStateException: Adding new inputs, 
> transformations, and output operations after starting a context is not 
> supported
>   at* 
> org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
>   at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)
>   at 
> org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>   at 
> org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
>   at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:557)
>   at 
> example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:125)
>   at 
> example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:114)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/05/25 11:30:30 INFO BlockManager: Removing RDD 105
>
>
> This is the code that rises the exception 

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka
improvement proposal for it but it has gotten pushed back to at least
0.10.1

If you want to do this kind of thing, you will need to maintain your
own index from time to offset.

On Wed, May 25, 2016 at 8:15 AM, trung kien  wrote:
> Hi all,
>
> Is there any way to re-compute using Spark Streaming - Kafka Direct Approach
> from specific time?
>
> In some cases, I want to re-compute again from specific time (e.g beginning
> of day)? is that possible?
>
>
>
> --
> Thanks
> Kien

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be
0.8.2.1) before trying anything else, even if it may be unrelated to
the actual error.  Definitely don't upgrade your brokers to 0.9

On Wed, May 25, 2016 at 2:30 AM, Scott W  wrote:
> I'm running into below error while trying to consume message from Kafka
> through Spark streaming (Kafka direct API). This used to work OK when using
> Spark standalone cluster manager. We're just switching to using Cloudera 5.7
> using Yarn to manage Spark cluster and started to see the below error.
>
> Few details:
> - Spark 1.6.0
> - Using Kafka direct stream API
> - Kafka broker version (0.8.2.1)
> - Kafka version in the classpath of Yarn executors (0.9)
> - Kafka brokers not managed by Cloudera
>
> The only difference I see between using standalone cluster manager and yarn
> is the Kafka version being used on the consumer end. (0.8.2.1 vs 0.9)
>
> Trying to figure if version mismatch is really an issue ? If indeed the
> case, what would be the fix for this other than upgrading Kafka brokers to
> 0.9 as well. (eventually yes but not for now) OR is there something else I'm
> missing here.
>
> Appreciate the help.
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 200.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 200.0 (TID 203,..): java.nio.BufferUnderflowException
> at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
> at java.nio.ByteBuffer.get(ByteBuffer.java:715)
> at kafka.api.ApiUtils$.readShortString(ApiUtils.scala:40)
> at kafka.api.TopicData$.readFrom(FetchResponse.scala:96)
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.Range.foreach(Range.scala:141)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:942)
> at
> org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:942)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Cody Koeninger
Am I reading this correctly that you're calling messages.foreachRDD inside
of the messages.foreachRDD block?  Don't do that.

On Wed, May 25, 2016 at 8:59 AM, Alonso  wrote:

> Hi, i am receiving this exception when direct spark streaming process
> tries to pull data from kafka topic:
>
> 16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863
> ms saved to file
> 'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863',
> took 5928 bytes and 8 ms
>
> 16/05/25 11:30:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1041 
> bytes result sent to driver
> 16/05/25 11:30:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) 
> in 4 ms on localhost (1/1)
> 16/05/25 11:30:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool
> 16/05/25 11:30:30 INFO DAGScheduler: ResultStage 2 (runJob at 
> KafkaRDD.scala:98) finished in 0,004 s
> 16/05/25 11:30:30 INFO DAGScheduler: Job 2 finished: runJob at 
> KafkaRDD.scala:98, took 0,008740 s
> <-->
> someMessages is [Lscala.Tuple2;@2641d687
> (null,{"userId":"someUserId","productId":"0981531679","rating":6.0})
> <-->
> <---POSSIBLE SOLUTION--->
> 16/05/25 11:30:30 INFO JobScheduler: Finished job streaming job 146416863 
> ms.0 from job set of time 146416863 ms
> 16/05/25 11:30:30 INFO KafkaRDD: Removing RDD 105 from persistence list
> 16/05/25 11:30:30 INFO JobScheduler: Total delay: 0,020 s for time 
> 146416863 ms (execution: 0,012 s)
> 16/05/25 11:30:30 ERROR JobScheduler: Error running job streaming job 
> 146416863 ms.0*java.lang.IllegalStateException: Adding new inputs, 
> transformations, and output operations after starting a context is not 
> supported
>   at* 
> org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
>   at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)
>   at 
> org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>   at 
> org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
>   at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:557)
>   at 
> example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:125)
>   at 
> example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:114)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/05/25 11:30:30 INFO BlockManager: Removing RDD 105
>
>
> This is the code that rises the exception in the spark streaming process:
>

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi,
  RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
like to filter out the strings that have greater than 85% match and
generate a score for it which is used in the susbsequent calculations.

I tried generating pair rdd from both rdds A and B with same key for all
the records. Now performing A.join(B) is also resulting in huge execution
time..

How do I go about with map and reduce here ? To generate pairs from 2 rdds
I dont think map can be used because we cannot have rdd inside another rdd.

Would be glad if you can throw me some light on this.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 7:39 PM, Jörn Franke  wrote:

> Solr or Elastic search provide much more functionality and are faster in
> this context. The decision for or against them depends on your current and
> future use cases. Your current use case is still very abstract so in order
> to get a more proper recommendation you need to provide more details
> including size of dataset, what you do with the result of the matching do
> you just need the match number or also the pairs in the results etc.
>
> Your concrete problem can also be solved in Spark (though it is not the
> best and most efficient tool for this, but it has other strength) using the
> map reduce steps. There are different ways to implement this (Generate
> pairs from the input datasets in the map step or (maybe less recommendable)
> broadcast the smaller dataset to all nodes and do the matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch  wrote:
>
> Why do i need to deploy solr for text anaytics...i have files placed in
> HDFS. just need to look for matches against each string in both files and
> generate those records whose match is > 85%. We trying to Fuzzy match
> logic.
>
> How can use map/reduce operations across 2 rdds ?
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke  wrote:
>
>>
>> Alternatively depending on the exact use case you may employ solr on
>> Hadoop for text analytics
>>
>> > On 25 May 2016, at 12:57, Priya Ch 
>> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>> B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>> need
>> > to check the matches found in rdd B as such for string "hi" i have to
>> check
>> > the matches against all strings in RDD B which means I need generate
>> every
>> > possible combination r
>>
>
>


Re: job build cost more and more time

2016-05-25 Thread nguyen duc tuan
Take a look in here:
http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes
So all you have to do create a checkpoint for a dataframe is as follow:
df.rdd.checkpoint
df.rdd.count // or any action

2016-05-25 8:43 GMT+07:00 naliazheli <754405...@qq.com>:

> i am using spark1.6 and noticed  time between jobs get longer,sometimes it
> could be 20 mins.
> i tried to search same questions ,and found a close one :
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146
>
> and found something useful:
> One thing to worry about is long-running jobs or shells. Currently, state
> buildup of a single job in Spark is a problem, as certain state such as
> shuffle files and RDD metadata is not cleaned up until the job (or shell)
> exits. We have hacky ways to reduce this, and are working on a long term
> solution. However, separate, consecutive jobs should be independent in
> terms
> of performance.
>
>
> On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[hidden email]> wrote:
> Is your spark app an iterative one ? If so, your app is creating a big DAG
> in every iteration. You should use checkpoint it periodically, say, 10
> iterations one checkpoint.
>
> i also wrote a test program,there is the code:
>
> public static void newJob(int jobNum,SQLContext sqlContext){
> for(int i=0;i testJob(i,sqlContext);
> }
> }
>
>
> public static void testJob(int jobNum,SQLContext sqlContext){
> String test_sql =" SELECT a.*   FROM income a";
> DataFrame test_df = sqlContext.sql(test_sql);
> test_df.registerTempTable("income");
> test_df.cache();
> test_df.count();
> test_df.show();
> }
> }
>
> function newJob(100,sqlContext)  could reproduce my issue,job build cost
> more and more time .
> DataFrame  without close api like checkpoint.
> Is there anothor way to resolve it?
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/job-build-cost-more-and-more-time-tp27017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Solr or Elastic search provide much more functionality and are faster in this 
context. The decision for or against them depends on your current and future 
use cases. Your current use case is still very abstract so in order to get a 
more proper recommendation you need to provide more details including size of 
dataset, what you do with the result of the matching do you just need the match 
number or also the pairs in the results etc.

Your concrete problem can also be solved in Spark (though it is not the best 
and most efficient tool for this, but it has other strength) using the map 
reduce steps. There are different ways to implement this (Generate pairs from 
the input datasets in the map step or (maybe less recommendable) broadcast the 
smaller dataset to all nodes and do the matching with the bigger dataset there.
This highly depends on the data in your data set. How they compare in size etc.



> On 25 May 2016, at 13:27, Priya Ch  wrote:
> 
> Why do i need to deploy solr for text anaytics...i have files placed in HDFS. 
> just need to look for matches against each string in both files and generate 
> those records whose match is > 85%. We trying to Fuzzy match logic. 
> 
> How can use map/reduce operations across 2 rdds ?
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke  wrote:
>> 
>> Alternatively depending on the exact use case you may employ solr on Hadoop 
>> for text analytics
>> 
>> > On 25 May 2016, at 12:57, Priya Ch  wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>> > to check the matches found in rdd B as such for string "hi" i have to check
>> > the matches against all strings in RDD B which means I need generate every
>> > possible combination r
> 


about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso
Hi, i am receiving this exception when direct spark streaming process tries
to pull data from kafka topic:

16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863
ms saved to file
'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863',
took 5928 bytes and 8 ms

16/05/25 11:30:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID
2). 1041 bytes result sent to driver
16/05/25 11:30:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0
(TID 2) in 4 ms on localhost (1/1)
16/05/25 11:30:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose
tasks have all completed, from pool
16/05/25 11:30:30 INFO DAGScheduler: ResultStage 2 (runJob at
KafkaRDD.scala:98) finished in 0,004 s
16/05/25 11:30:30 INFO DAGScheduler: Job 2 finished: runJob at
KafkaRDD.scala:98, took 0,008740 s
<-->
someMessages is [Lscala.Tuple2;@2641d687
(null,{"userId":"someUserId","productId":"0981531679","rating":6.0})
<-->
<---POSSIBLE SOLUTION--->
16/05/25 11:30:30 INFO JobScheduler: Finished job streaming job
146416863 ms.0 from job set of time 146416863 ms
16/05/25 11:30:30 INFO KafkaRDD: Removing RDD 105 from persistence list
16/05/25 11:30:30 INFO JobScheduler: Total delay: 0,020 s for time
146416863 ms (execution: 0,012 s)
16/05/25 11:30:30 ERROR JobScheduler: Error running job streaming job
146416863 ms.0*java.lang.IllegalStateException: Adding new inputs,
transformations, and output operations after starting a context is not
supported
at* 
org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)
at 
org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at 
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:557)
at 
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:125)
at 
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:114)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/05/25 11:30:30 INFO BlockManager: Removing RDD 105


This is the code that rises the exception in the spark streaming process:

try{
messages.foreachRDD( rdd =>{
  val count = rdd.count()
  if (count > 0){
//someMessages should be AmazonRating...
val someMessages = rdd.take(count.toInt)
println("<-->")
println("someMessages is " + someMessages)
someMessages.foreach(println)

about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
Hi, i am receiving this exception when direct spark streaming process tries
to pull data from kafka topic:

16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863
ms saved to file
'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863',
took 5928 bytes and 8 ms

16/05/25 11:30:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID
2). 1041 bytes result sent to driver
16/05/25 11:30:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0
(TID 2) in 4 ms on localhost (1/1)
16/05/25 11:30:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose
tasks have all completed, from pool
16/05/25 11:30:30 INFO DAGScheduler: ResultStage 2 (runJob at
KafkaRDD.scala:98) finished in 0,004 s
16/05/25 11:30:30 INFO DAGScheduler: Job 2 finished: runJob at
KafkaRDD.scala:98, took 0,008740 s
<-->
someMessages is [Lscala.Tuple2;@2641d687
(null,{"userId":"someUserId","productId":"0981531679","rating":6.0})
<-->
<---POSSIBLE SOLUTION--->
16/05/25 11:30:30 INFO JobScheduler: Finished job streaming job
146416863 ms.0 from job set of time 146416863 ms
16/05/25 11:30:30 INFO KafkaRDD: Removing RDD 105 from persistence list
16/05/25 11:30:30 INFO JobScheduler: Total delay: 0,020 s for time
146416863 ms (execution: 0,012 s)
16/05/25 11:30:30 ERROR JobScheduler: Error running job streaming job
146416863 ms.0*java.lang.IllegalStateException: Adding new inputs,
transformations, and output operations after starting a context is not
supported
at* 
org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
at org.apache.spark.streaming.dstream.DStream.(DStream.scala:64)
at 
org.apache.spark.streaming.dstream.MappedDStream.(MappedDStream.scala:25)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:558)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at 
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:557)
at 
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:125)
at 
example.spark.AmazonKafkaConnector$$anonfun$main$1.apply(AmazonKafkaConnectorWithMongo.scala:114)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/05/25 11:30:30 INFO BlockManager: Removing RDD 105


This is the code that rises the exception in the spark streaming process:

try{
messages.foreachRDD( rdd =>{
  val count = rdd.count()
  if (count > 0){
//someMessages should be AmazonRating...
val someMessages = rdd.take(count.toInt)
println("<-->")
println("someMessages is " + someMessages)
someMessages.foreach(println)

Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Hi all,

Is there any way to re-compute using Spark Streaming - Kafka Direct
Approach from specific time?

In some cases, I want to re-compute again from specific time (e.g beginning
of day)? is that possible?



-- 
Thanks
Kien


StackOverflow in Spark

2016-05-25 Thread Michel Hubert

Hi,


I have an Spark application which generates StackOverflowError exceptions after 
30+ min.

Anyone any ideas?

Seems like problems with deserialization of checkpoint data?





16/05/25 10:48:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 55449.0 
(TID 5584, host81440-cld.opentsp.com): java.lang.StackOverflowError
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
*at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
*at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
*at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
*at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
*at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
*at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
*at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
*at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
*at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
*at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
*at java.lang.reflect.Method.invoke(Method.java:606)
*at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
*at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
*at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
*at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
*at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
*at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
*at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
*at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
*at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
*at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
*at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
*at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)




Driver stacktrace:
*at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
*at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
*at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
*at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
*at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
*at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
*at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
*at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
*at scala.Option.foreach(Option.scala:236)
*at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
*at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
*at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
*at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
*at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
*at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
*at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843)
*at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
*at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933)
*at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:67)
*at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:54)
*at org.elasticsearch.spark.rdd.EsSpark$.saveJsonToEs(EsSpark.scala:90)
*at 
org.elasticsearch.spark.rdd.api.java.JavaEsSpark$.saveJsonToEs(JavaEsSpark.scala:62)
at 
org.elasticsearch.spark.rdd.api.java.JavaEsSpark.saveJsonToEs(JavaEsSpark.scala)



Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread Takeshi Yamamuro
Hi,

You need to describe more to make others easily understood;
what's the version of spark and what's the query you use?

// maropu


On Wed, May 25, 2016 at 8:27 PM, vaibhav srivastava 
wrote:

> Hi All,
>
>  I am facing below stack traces while reading data from parquet file
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
>
> at parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:247)
>
> at
> parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:47)
>
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
>
> at
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
>
> at
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
>
> at
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
>
> at
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
>
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>
> at scala.Option.map(Option.scala:145)
>
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
>
> at
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>
> at
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65)
>
> at
> org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165)
>
> Please suggest. It seems like it not able to convert some data
>



-- 
---
Takeshi Yamamuro


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Why do i need to deploy solr for text anaytics...i have files placed in
HDFS. just need to look for matches against each string in both files and
generate those records whose match is > 85%. We trying to Fuzzy match
logic.

How can use map/reduce operations across 2 rdds ?

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:49 PM, Jörn Franke  wrote:

>
> Alternatively depending on the exact use case you may employ solr on
> Hadoop for text analytics
>
> > On 25 May 2016, at 12:57, Priya Ch  wrote:
> >
> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B
> of
> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> > to check the matches found in rdd B as such for string "hi" i have to
> check
> > the matches against all strings in RDD B which means I need generate
> every
> > possible combination r
>


Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread vaibhav srivastava
Hi All,

 I am facing below stack traces while reading data from parquet file

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:247)

at
parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:47)

at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)

at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)

at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)

at
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)

at
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)

at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)

at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)

at scala.Option.map(Option.scala:145)

at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)

at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)

at
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65)

at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165)

Please suggest. It seems like it not able to convert some data


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke

Alternatively depending on the exact use case you may employ solr on Hadoop for 
text analytics 

> On 25 May 2016, at 12:57, Priya Ch  wrote:
> 
> Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> to check the matches found in rdd B as such for string "hi" i have to check
> the matches against all strings in RDD B which means I need generate every
> possible combination r

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke

No this is not needed, look at the map / reduce operations and the standard 
spark word count

> On 25 May 2016, at 12:57, Priya Ch  wrote:
> 
> Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
> to check the matches found in rdd B as such for string "hi" i have to check
> the matches against all strings in RDD B which means I need generate every
> possible combination r

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i have to check
the matches against all strings in RDD B which means I need generate every
possible combination right.. Hence generating cartesian product and then
 using map transformation on cartesian rdd I am trying to check the matches
found.

Is there any better way I could do other than performaing cartesian. Till
now application took 30 mins and on top of that I see executor lost issues.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:22 PM, Jörn Franke  wrote:

> What is the use case of this ? A Cartesian product is by definition slow
> in any system. Why do you need this? How long does your application take
> now?
>
> On 25 May 2016, at 12:42, Priya Ch  wrote:
>
> I tried
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
> this is taking too much time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro 
> wrote:
>
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
>> parquet, orc, ...?
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch 
>> wrote:
>>
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>>> saveAsTextFile, trying to save it. However, this is also taking too much
>>> time.
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 Seems you'd be better off using DataFrame#join instead of  RDD
 .cartesian
 because it always needs shuffle operations which have alot of
 overheads such as reflection, serialization, ...
 In your case,  since the smaller table is 7mb, DataFrame#join uses a
 broadcast strategy.
 This is a little more efficient than  RDD.cartesian.

 // maropu

 On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> It is basically a Cartesian join like RDBMS
>
> Example:
>
> SELECT * FROM FinancialCodes,  FinancialData
>
> The results of this query matches every row in the FinancialCodes
> table with every row in the FinancialData table.  Each row consists
> of all columns from the FinancialCodes table followed by all columns from
> the FinancialData table.
>
>
> Not very useful
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 May 2016 at 08:05, Priya Ch 
> wrote:
>
>> Hi All,
>>
>>   I have two RDDs A and B where in A is of size 30 MB and B is of
>> size 7 MB, A.cartesian(B) is taking too much time. Is there any 
>> bottleneck
>> in cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>


 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
What is the use case of this ? A Cartesian product is by definition slow in any 
system. Why do you need this? How long does your application take now?

> On 25 May 2016, at 12:42, Priya Ch  wrote:
> 
> I tried 
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even 
> this is taking too much time.
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro  
>> wrote:
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as 
>> parquet, orc, ...?
>> 
>> // maropu
>> 
>>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch  
>>> wrote:
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I 
>>> am converting the joined dataframe to rdd (dataframe.rdd) and using 
>>> saveAsTextFile, trying to save it. However, this is also taking too much 
>>> time.
>>> 
>>> Thanks,
>>> Padma Ch
>>> 
 On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro  
 wrote:
 Hi, 
 
 Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
 because it always needs shuffle operations which have alot of overheads 
 such as reflection, serialization, ...
 In your case,  since the smaller table is 7mb, DataFrame#join uses a 
 broadcast strategy.
 This is a little more efficient than  RDD.cartesian.
 
 // maropu
 
> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh 
>  wrote:
> It is basically a Cartesian join like RDBMS 
> 
> Example:
> 
> SELECT * FROM FinancialCodes,  FinancialData
> 
> The results of this query matches every row in the FinancialCodes table 
> with every row in the FinancialData table.  Each row consists of all 
> columns from the FinancialCodes table followed by all columns from the 
> FinancialData table.
> 
> Not very useful 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 25 May 2016 at 08:05, Priya Ch  wrote:
>> Hi All,
>> 
>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7 
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in 
>> cartesian operation ?
>> 
>> I am using spark 1.6.0 version
>> 
>> Regards,
>> Padma Ch
 
 
 
 -- 
 ---
 Takeshi Yamamuro
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
I tried
dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
this is taking too much time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro 
wrote:

> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch 
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>> saveAsTextFile, trying to save it. However, this is also taking too much
>> time.
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>> because it always needs shuffle operations which have alot of overheads
>>> such as reflection, serialization, ...
>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>> broadcast strategy.
>>> This is a little more efficient than  RDD.cartesian.
>>>
>>> // maropu
>>>
>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 It is basically a Cartesian join like RDBMS

 Example:

 SELECT * FROM FinancialCodes,  FinancialData

 The results of this query matches every row in the FinancialCodes table
 with every row in the FinancialData table.  Each row consists of all
 columns from the FinancialCodes table followed by all columns from the
 FinancialData table.


 Not very useful


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 25 May 2016 at 08:05, Priya Ch  wrote:

> Hi All,
>
>   I have two RDDs A and B where in A is of size 30 MB and B is of size
> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
> cartesian operation ?
>
> I am using spark 1.6.0 version
>
> Regards,
> Padma Ch
>


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Using Java in Spark shell

2016-05-25 Thread Keith


There is no java shell in spark.

> On May 25, 2016, at 1:11 AM, Ashok Kumar  wrote:
> 
> Hello,
> 
> A newbie question.
> 
> Is it possible to use java code directly in spark shell without using maven 
> to build a jar file?
> 
> How can I switch from scala to java in spark shell?
> 
> Thanks
> 
> 


Re: Using Java in Spark shell

2016-05-25 Thread Ted Yu
I found this:

:javap disassemble a file or class name

But no direct interpretation of Java code.

On Tue, May 24, 2016 at 10:11 PM, Ashok Kumar 
wrote:

> Hello,
>
> A newbie question.
>
> Is it possible to use java code directly in spark shell without using
> maven to build a jar file?
>
> How can I switch from scala to java in spark shell?
>
> Thanks
>
>
>


Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Thank you for your help Mich. 

ThanksRajesh
 

Sent from Yahoo Mail. Get the app 

On Wednesday, May 25, 2016 3:14 PM, Mich Talebzadeh 
 wrote:
 

 You may have some memory issues OOM etc that terminated the process. 
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 25 May 2016 at 10:35,  wrote:

Hi Friends,
In the yarn log files of the nodemanager i can see the error below. Can i know 
why i am getting this error.

ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: 
SIGTERM
ThanksRajesh
 

Sent from Yahoo Mail. Get the app 

On Wednesday, May 25, 2016 1:08 PM, Mich Talebzadeh 
 wrote:
 

 Yes check the yarn log files both resourcemanager and nodemanager. Also ensure 
that you have set up work directories consistently, especially 
yarn.nodemanager.local-dirs 
HTH
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 25 May 2016 at 08:29, Jeff Zhang  wrote:

Could you check the yarn app logs ?

On Wed, May 25, 2016 at 3:23 PM,  wrote:

Hi,
I am running spark streaming job on yarn-client mode. If run muliple jobs, some 
of the jobs failing and giving below error message. Is there any configuration 
missing?
ERROR apache.spark.util.Utils - Uncaught exception in thread main
java.lang.NullPointerException
    at 
org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
    at 
org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
    at org.apache.spark.SparkContext.(SparkContext.scala:593)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Yarn application 
has already ended! It might have been killed or unable to launch application 
master.
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.(SparkContext.scala:523)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at 

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
parquet, orc, ...?

// maropu

On Wed, May 25, 2016 at 7:10 PM, Priya Ch 
wrote:

> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
> am converting the joined dataframe to rdd (dataframe.rdd) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>> because it always needs shuffle operations which have alot of overheads
>> such as reflection, serialization, ...
>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>> broadcast strategy.
>> This is a little more efficient than  RDD.cartesian.
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is basically a Cartesian join like RDBMS
>>>
>>> Example:
>>>
>>> SELECT * FROM FinancialCodes,  FinancialData
>>>
>>> The results of this query matches every row in the FinancialCodes table
>>> with every row in the FinancialData table.  Each row consists of all
>>> columns from the FinancialCodes table followed by all columns from the
>>> FinancialData table.
>>>
>>>
>>> Not very useful
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 25 May 2016 at 08:05, Priya Ch  wrote:
>>>
 Hi All,

   I have two RDDs A and B where in A is of size 30 MB and B is of size
 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
 cartesian operation ?

 I am using spark 1.6.0 version

 Regards,
 Padma Ch

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
am converting the joined dataframe to rdd (dataframe.rdd) and using
saveAsTextFile, trying to save it. However, this is also taking too much
time.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
> because it always needs shuffle operations which have alot of overheads
> such as reflection, serialization, ...
> In your case,  since the smaller table is 7mb, DataFrame#join uses a
> broadcast strategy.
> This is a little more efficient than  RDD.cartesian.
>
> // maropu
>
> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It is basically a Cartesian join like RDBMS
>>
>> Example:
>>
>> SELECT * FROM FinancialCodes,  FinancialData
>>
>> The results of this query matches every row in the FinancialCodes table
>> with every row in the FinancialData table.  Each row consists of all
>> columns from the FinancialCodes table followed by all columns from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 May 2016 at 08:05, Priya Ch  wrote:
>>
>>> Hi All,
>>>
>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>> cartesian operation ?
>>>
>>> I am using spark 1.6.0 version
>>>
>>> Regards,
>>> Padma Ch
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Mich Talebzadeh
You may have some memory issues OOM etc that terminated the process.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 25 May 2016 at 10:35,  wrote:

> Hi Friends,
>
> In the yarn log files of the nodemanager i can see the error below. Can i
> know why i am getting this error.
>
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15:
> SIGTERM
>
> Thanks
> Rajesh
>
>
>
> Sent from Yahoo Mail. Get the app 
>
>
> On Wednesday, May 25, 2016 1:08 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Yes check the yarn log files both resourcemanager and nodemanager. Also
> ensure that you have set up work directories consistently, especially
> yarn.nodemanager.local-dirs
>
> HTH
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 25 May 2016 at 08:29, Jeff Zhang  wrote:
>
> Could you check the yarn app logs ?
>
>
> On Wed, May 25, 2016 at 3:23 PM,  wrote:
>
> Hi,
>
> I am running spark streaming job on yarn-client mode. If run muliple jobs,
> some of the jobs failing and giving below error message. Is there any
> configuration missing?
>
> ERROR apache.spark.util.Utils - Uncaught exception in thread main
> java.lang.NullPointerException
> at
> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
> at
> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
> at org.apache.spark.SparkContext.(SparkContext.scala:593)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Yarn
> application has already ended! It might have been killed or unable to
> launch application master.
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> 

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Hi Friends,
In the yarn log files of the nodemanager i can see the error below. Can i know 
why i am getting this error.

ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: 
SIGTERM
ThanksRajesh
 

Sent from Yahoo Mail. Get the app 

On Wednesday, May 25, 2016 1:08 PM, Mich Talebzadeh 
 wrote:
 

 Yes check the yarn log files both resourcemanager and nodemanager. Also ensure 
that you have set up work directories consistently, especially 
yarn.nodemanager.local-dirs 
HTH
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 25 May 2016 at 08:29, Jeff Zhang  wrote:

Could you check the yarn app logs ?

On Wed, May 25, 2016 at 3:23 PM,  wrote:

Hi,
I am running spark streaming job on yarn-client mode. If run muliple jobs, some 
of the jobs failing and giving below error message. Is there any configuration 
missing?
ERROR apache.spark.util.Utils - Uncaught exception in thread main
java.lang.NullPointerException
    at 
org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
    at 
org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
    at org.apache.spark.SparkContext.(SparkContext.scala:593)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Yarn application 
has already ended! It might have been killed or unable to launch application 
master.
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.(SparkContext.scala:523)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO  apache.spark.storage.DiskBlockManager - Shutdown hook called
INFO  apache.spark.util.ShutdownHookManager - Shutdown hook called
INFO  apache.spark.util.ShutdownHookManager - Deleting directory 
/tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
INFO  apache.spark.util.ShutdownHookManager - 

Re: Is it a bug?

2016-05-25 Thread Zheng Wendell
Any update?

On Sun, May 8, 2016 at 3:17 PM, Ted Yu  wrote:

> I don't think so.
> RDD is immutable.
>
> > On May 8, 2016, at 2:14 AM, Sisyphuss  wrote:
> >
> > 
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-a-bug-tp26898.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: How does Spark set task indexes?

2016-05-25 Thread Adrien Mogenet
Yes I've noticed this one and its related cousin, but not sure this is the
same issue there; our job "properly" ends after 6 attempts.
We'll try with disabled speculative mode anyway!

On 25 May 2016 at 00:13, Ted Yu  wrote:

> Have you taken a look at SPARK-14915 ?
>
> On Tue, May 24, 2016 at 1:00 PM, Adrien Mogenet <
> adrien.moge...@contentsquare.com> wrote:
>
>> Hi,
>>
>> I'm wondering how Spark is setting the "index" of task?
>> I'm asking this question because we have a job that constantly fails at
>> task index = 421.
>>
>> When increasing number of partitions, this then fails at index=4421.
>> Increase it a little bit more, now it's 24421.
>>
>> Our job is as simple as "(1) read json -> (2) group-by sesion identifier
>> -> (3) write parquet files" and always fails somewhere at step (3) with a
>> CommitDeniedException. We've identified that some troubles are basically
>> due to uneven data repartition right after step (2), and now try to go
>> further in our understanding on how does Spark behaves.
>>
>> We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.moge...@contentsquare.com
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris


never understand

2016-05-25 Thread pseudo oduesp
hi guys ,
-i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn)

-i use yarn to deploy job on cluster.
-i use hive context  and parquet file to save my data.
limit container 16 GB
number of executor i tested befor it s 12 GB (executor memory)
-i tested  to increase number of partitions (by default it s 200) i
multipie by 2 and 3  whitout succes.

-I try to change number of sql partitins shuffle


-i remarque in spark UI when (shuffle write it triggerd no problem) but
(when shuffle read triggerd i lost executors and get erros)



and realy blocked by this error  where she came from




 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread
Thread[Executor task launch worker-5,5,main]
java.lang.OutOfMemoryError: Java heap space
at
parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at parquet.column.values.dictionary.IntList.(IntList.java:86)
at
parquet.column.values.dictionary.DictionaryValuesWriter.(DictionaryValuesWriter.java:93)
at
parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.(DictionaryValuesWriter.java:229)
at
parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
at
parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
at
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
at parquet.column.impl.ColumnWriterV1.(ColumnWriterV1.java:84)
at
parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
at
parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.(MessageColumnIO.java:207)
at
parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
at
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
 at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
at
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/25 09:54:42 ERROR util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-6,5,main]
java.lang.OutOfMemoryError: Java heap space
at
parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at parquet.column.values.dictionary.IntList.(IntList.java:86)
at
parquet.column.values.dictionary.DictionaryValuesWriter.(DictionaryValuesWriter.java:93)
at
parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.(DictionaryValuesWriter.java:229)
at
parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
at

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Hi,

Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
because it always needs shuffle operations which have alot of overheads
such as reflection, serialization, ...
In your case,  since the smaller table is 7mb, DataFrame#join uses a
broadcast strategy.
This is a little more efficient than  RDD.cartesian.

// maropu

On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh 
wrote:

> It is basically a Cartesian join like RDBMS
>
> Example:
>
> SELECT * FROM FinancialCodes,  FinancialData
>
> The results of this query matches every row in the FinancialCodes table
> with every row in the FinancialData table.  Each row consists of all
> columns from the FinancialCodes table followed by all columns from the
> FinancialData table.
>
>
> Not very useful
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 May 2016 at 08:05, Priya Ch  wrote:
>
>> Hi All,
>>
>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>


-- 
---
Takeshi Yamamuro


Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Mich Talebzadeh
Yes check the yarn log files both resourcemanager and nodemanager. Also
ensure that you have set up work directories consistently, especially
yarn.nodemanager.local-dirs

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 25 May 2016 at 08:29, Jeff Zhang  wrote:

> Could you check the yarn app logs ?
>
>
> On Wed, May 25, 2016 at 3:23 PM,  wrote:
>
>> Hi,
>>
>> I am running spark streaming job on yarn-client mode. If run muliple
>> jobs, some of the jobs failing and giving below error message. Is there any
>> configuration missing?
>>
>> ERROR apache.spark.util.Utils - Uncaught exception in thread main
>> java.lang.NullPointerException
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
>> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
>> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
>> at
>> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
>> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
>> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
>> at org.apache.spark.SparkContext.(SparkContext.scala:593)
>> at
>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
>> at
>> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>> at
>> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
>> at
>> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
>> at
>> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
>> Exception in thread "main" org.apache.spark.SparkException: Yarn
>> application has already ended! It might have been killed or unable to
>> launch application master.
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>> at org.apache.spark.SparkContext.(SparkContext.scala:523)
>> at
>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
>> at
>> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>> at
>> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
>> at
>> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
>> at
>> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> INFO  apache.spark.storage.DiskBlockManager - Shutdown hook called
>> INFO  apache.spark.util.ShutdownHookManager - Shutdown hook called
>> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
>> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
>> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
>> 

Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Scott W
I'm running into below error while trying to consume message from Kafka
through Spark streaming (Kafka direct API). This used to work OK when using
Spark standalone cluster manager. We're just switching to using Cloudera
5.7 using Yarn to manage Spark cluster and started to see the below error.

Few details:
- Spark 1.6.0
- Using Kafka direct stream API
- Kafka broker version (0.8.2.1)
- Kafka version in the classpath of Yarn executors (0.9)
- Kafka brokers not managed by Cloudera

The only difference I see between using standalone cluster manager and yarn
is the Kafka version being used on the consumer end. (0.8.2.1 vs 0.9)

*Trying to figure if version mismatch is really an issue ? If indeed the
case, what would be the fix for this other than upgrading Kafka brokers to
0.9 as well. (eventually yes but not for now) OR is there something else
I'm missing here.*

Appreciate the help.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 200.0 failed 4 times, most recent failure: Lost task 0.3 in stage
200.0 (TID 203,..): java.nio.BufferUnderflowException
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at kafka.api.ApiUtils$.readShortString(ApiUtils.scala:40)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:96)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:942)
at
org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:942)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)


Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang
Could you check the yarn app logs ?


On Wed, May 25, 2016 at 3:23 PM,  wrote:

> Hi,
>
> I am running spark streaming job on yarn-client mode. If run muliple jobs,
> some of the jobs failing and giving below error message. Is there any
> configuration missing?
>
> ERROR apache.spark.util.Utils - Uncaught exception in thread main
> java.lang.NullPointerException
> at
> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
> at
> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
> at org.apache.spark.SparkContext.(SparkContext.scala:593)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Yarn
> application has already ended! It might have been killed or unable to
> launch application master.
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
> at
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
> at
> com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO  apache.spark.storage.DiskBlockManager - Shutdown hook called
> INFO  apache.spark.util.ShutdownHookManager - Shutdown hook called
> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
> INFO  apache.spark.util.ShutdownHookManager - Deleting directory
> /tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061
>
>
>
> Sent from Yahoo Mail. Get the app 
>



-- 
Best Regards

Jeff Zhang


run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Hi,
I am running spark streaming job on yarn-client mode. If run muliple jobs, some 
of the jobs failing and giving below error message. Is there any configuration 
missing?
ERROR apache.spark.util.Utils - Uncaught exception in thread main
java.lang.NullPointerException
    at 
org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
    at 
org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
    at org.apache.spark.SparkContext.(SparkContext.scala:593)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Yarn application 
has already ended! It might have been killed or unable to launch application 
master.
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
    at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.(SparkContext.scala:523)
    at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:878)
    at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
    at 
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:134)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.init(SparkTweetStreamingHDFSLoad.java:212)
    at 
com.infinite.spark.SparkTweetStreamingHDFSLoad.main(SparkTweetStreamingHDFSLoad.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO  apache.spark.storage.DiskBlockManager - Shutdown hook called
INFO  apache.spark.util.ShutdownHookManager - Shutdown hook called
INFO  apache.spark.util.ShutdownHookManager - Deleting directory 
/tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061/userFiles-857fece4-83c4-441a-8d3e-2a6ae8e3193a
INFO  apache.spark.util.ShutdownHookManager - Deleting directory 
/tmp/spark-945fa8f4-477c-4a65-a572-b247e9249061
 

Sent from Yahoo Mail. Get the app

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Mich Talebzadeh
It is basically a Cartesian join like RDBMS

Example:

SELECT * FROM FinancialCodes,  FinancialData

The results of this query matches every row in the FinancialCodes table
with every row in the FinancialData table.  Each row consists of all
columns from the FinancialCodes table followed by all columns from the
FinancialData table.


Not very useful


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 25 May 2016 at 08:05, Priya Ch  wrote:

> Hi All,
>
>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
> cartesian operation ?
>
> I am using spark 1.6.0 version
>
> Regards,
> Padma Ch
>


python application cluster mode in standalone spark cluster

2016-05-25 Thread Jan Sourek
A the official documentation states 'Currently only YARN supports cluster
mode for Python applications.'
I would like to know if work is being done or planned to support cluster
mode for Python applications on standalone spark clusters?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/python-application-cluster-mode-in-standalone-spark-cluster-tp27020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All,

  I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?

I am using spark 1.6.0 version

Regards,
Padma Ch


Re: Not able to write output to local filsystem from Standalone mode.

2016-05-25 Thread Jacek Laskowski
Hi Mathieu,

Thanks a lot for the answer! I did *not* know it's the driver to
create the directory.

You said "standalone mode", is this the case for the other modes -
yarn and mesos?

p.s. Did you find it in the code or...just experienced before? #curious

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin  wrote:
> In standalone mode, executor assume they have access to a shared file
> system. The driver creates the directory and the executor write files, so
> the executors end up not writing anything since there is no local directory.
>
> On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi  wrote:
>>
>> hi Jacek,
>>
>> Parent directory already present, its my home directory. Im using Linux
>> (Redhat) machine 64 bit.
>> Also I noticed that "test1" folder is created in my master with
>> subdirectory as "_temporary" which is empty. but on slaves, no such
>> directory is created under /home/stuti.
>>
>> Thanks
>> Stuti
>> 
>> From: Jacek Laskowski [ja...@japila.pl]
>> Sent: Tuesday, May 24, 2016 5:27 PM
>> To: Stuti Awasthi
>> Cc: user
>> Subject: Re: Not able to write output to local filsystem from Standalone
>> mode.
>>
>> Hi,
>>
>> What happens when you create the parent directory /home/stuti? I think the
>> failure is due to missing parent directories. What's the OS?
>>
>> Jacek
>>
>> On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:
>>
>> Hi All,
>>
>> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
>> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
>> shell , read the input file from local filesystem and perform transformation
>> successfully. When I try to write my output in local filesystem path then I
>> receive below error .
>>
>>
>>
>> I tried to search on web and found similar Jira :
>> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
>> resolved for Spark 1.3+ but already people have posted the same issue still
>> persists in latest versions.
>>
>>
>>
>> ERROR
>>
>> scala> data.saveAsTextFile("/home/stuti/test1")
>>
>> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
>> server1): java.io.IOException: The temporary job-output directory
>> file:/home/stuti/test1/_temporary doesn't exist!
>>
>> at
>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>>
>> at
>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
>>
>> at
>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>>
>> at
>> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
>>
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
>>
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>> What is the best way to resolve this issue if suppose I don’t want to have
>> Hadoop installed OR is it mandatory to have Hadoop to write the output from
>> Standalone cluster mode.
>>
>>
>>
>> Please suggest.
>>
>>
>>
>> Thanks 
>>
>> Stuti Awasthi
>>
>>
>>
>>
>>
>> ::DISCLAIMER::
>>
>> 
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> E-mail transmission is not guaranteed to be secure or error-free as
>> information could be intercepted, corrupted,
>> lost, destroyed, arrive late or incomplete, or may contain viruses in
>> transmission. The e mail and its contents
>> (with or without referred errors) shall therefore not attach any liability
>> on the originator or HCL or its affiliates.
>> Views or opinions, if any, presented in this email are solely those of the
>> author and may not necessarily reflect the
>> views or opinions of HCL or its affiliates. Any form of reproduction,
>> dissemination, copying, disclosure, modification,
>> distribution and / or publication of this message without the prior
>> written consent of authorized representative of
>> HCL is 

Re: why spark 1.6 use Netty instead of Akka?

2016-05-25 Thread Chaoqiang
Todd, Jakob, Thank you for your replies.

But there are still some questions I can't understand. Just like below:
1. What's the effect about Akka's dependencies to Spark?
2. What's the detail conflicts with projects using both Spark and Akka?
3. Which portion of Akka's features was used in Spark, and what's the extra
maintenance burden using Akka in Spark?
4. Why reducing dependencies to core functionality is a good thing?
Please forgive me that I'm a freshman in Spark. 

Thanks & Regards.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27005p27019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org