question about Union in pyspark and preserving partitioners

2016-05-16 Thread Cameron Davidson-Pilon
I'm looking into how to do more efficient jobs by using partition
strategies, but I'm hitting a blocker after I do a `union` between two
RDDs. Suppose A and B are both RDDs with the same partitioner, that is,

`A.partitioner == B.partitioner`

If I do A.union(B), the resulting RDD has None partitioner. If I am reading
the code correctly, this comes from this line returning False and a
partitioner never being set:
https://github.com/apache/spark/blob/95f4fbae52d26ede94c3ba8248394749f3d95dcc/python/pyspark/rdd.py#L532

What confuses me is that that line will _always_ return False: is it not
true that the union of RDDs results in the sum of number of partitions,
which will never be equal to self.getNumPartitions?

Am I missing something about this logical check?

-- 


Cam Davidson-Pilon
cam...@shopify.com


Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Mail.com
Hi Muthu,

Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for simple 
string messages.

Console producer and consumer work fine. But spark always reruns empty RDD. I 
am using Receiver based Approach. 

Thanks,
Pradeep

> On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman 
>  wrote:
> 
> Yes, I can see the messages. Also, I wrote a quick custom decoder for avro 
> and it works fine for the following:
> 
>>> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
>>> brokers}, valueDecoder=decoder)
> 
> But, when I use the Confluent Serializers to leverage the Schema Registry 
> (based on the link shown below), it doesn’t work for me. I am not sure 
> whether I need to configure any more details to consume the Schema Registry. 
> I can fetch the schema from the schema registry based on is Ids. The decoder 
> method is not returning any values for me.
> 
> ~Muthu
> 
> 
> 
>> On 5/16/16, 10:49 AM, "Cody Koeninger"  wrote:
>> 
>> Have you checked to make sure you can receive messages just using a
>> byte array for value?
>> 
>> On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>>  wrote:
>>> I am trying to consume AVRO formatted message through
>>> KafkaUtils.createDirectStream. I followed the listed below example (refer
>>> link) but the messages are not being fetched by the Stream.
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>>  
>>> 
>>> Is there any code missing that I must add to make the above sample work.
>>> Say, I am not sure how the confluent serializers would know the avro schema
>>> info as it knows only the Schema Registry URL info.
>>> 
>>> Appreciate your help.
>>> 
>>> ~Muthu
> B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ü\šË˜\XÚK›Ü™ÃBƒB

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtRbEiIXuOOS=Re+PySpark+issue+with+sortByKey+IndexError+list+index+out+of+range+

which led to SPARK-4384

On Mon, May 16, 2016 at 8:09 PM, kramer2...@126.com 
wrote:

> I know the cache operation can cache data in memoyr/disk...
>
> But I am expecting to know will other operation will do the same?
>
> For example, I created a dataframe called df. The df is big so when I run
> some action like :
>
> df.sort(column_name).show()
> df.collect()
>
> It will throw error like :
> 16/05/17 10:53:36 ERROR Executor: Managed memory leak detected;
> size =
> 2359296 bytes, TID = 15
> 16/05/17 10:53:36 ERROR Executor: Exception in task 0.0 in stage
> 12.0 (TID
> 15)
> org.apache.spark.api.python.PythonException: Traceback (most
> recent call
> last):
>   File
> "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
> line 111, in main
> process()
>   File
> "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
> line 106, in process
> serializer.dump_stream(func(split_index, iterator),
> outfile)
>   File
>
> "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
> line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "", line 1, in 
> IndexError: list index out of range
>
>
> I want to know is there any way or configuration to let spark swap memory
> into disk for this situation?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-spark-swap-memory-out-to-disk-if-the-memory-is-not-enough-tp26968.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread kramer2...@126.com
I know the cache operation can cache data in memoyr/disk... 

But I am expecting to know will other operation will do the same?

For example, I created a dataframe called df. The df is big so when I run
some action like :

df.sort(column_name).show()
df.collect()

It will throw error like :
16/05/17 10:53:36 ERROR Executor: Managed memory leak detected; size =
2359296 bytes, TID = 15
16/05/17 10:53:36 ERROR Executor: Exception in task 0.0 in stage 12.0 
(TID
15)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "", line 1, in 
IndexError: list index out of range


I want to know is there any way or configuration to let spark swap memory
into disk for this situation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-spark-swap-memory-out-to-disk-if-the-memory-is-not-enough-tp26968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



what is the wrong while adding one column in the dataframe

2016-05-16 Thread Zhiliang Zhu
 Hi All,
For the given DataFrame created by hive sql, however, then it is required to 
add one more column based on the existing column, and should also keep the 
previous columns there for the result DataFrame.

final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0;
//DAYS_30 seems difficult to call in the sql ? 
DataFrame behavior_df = jhql.sql("SELECT cast (user_id as double) as user_id, 
cast (server_timestamp as 
   double) as server_timestamp, url, referer, source, 
app_version, params FROM log.request");
//it is okay to run, but behavior_df.printSchema() not changed any
behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));

//it is okay to run, but behavior_df.printSchema() only has one column as 
daysLater30 .//it would be the schema is with the previous all columns and 
added one as daysLater30 
behavior_df = behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));
Then, how would do it?
Thank you, 



?????? Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
I Use $SPARK_HOME/sbin/start-slave.sh client10, get this error:
  
 org.apache.spark.deploy.worker.Worker running as process 5490.  Stop it first.
  

 

 --  --
  ??: "Ted Yu";;
 : 2016??5??17??(??) 10:22
 ??: "sunday2000"<2314476...@qq.com>; 
 : "user"; 
 : Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

 

 I guess 2.0 would be released before Spark Summit.
 
 On Mon, May 16, 2016 at 7:19 PM, sunday2000 <2314476...@qq.com> wrote:
  Hi,
   I found the bug status is : Solved, then when will release the solved 
version?
  

 

 --  --
  ??: "Ted Yu";;
 : 2016??5??17??(??) 9:58
 ??: "sunday2000"<2314476...@qq.com>; 
 : "user"; 
 : Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

   

 Please take a look at this JIRA:  
 [SPARK-13604][CORE] Sync worker's state after registering with master



 
 On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote:
  Hi,
  
   A client woker stoppped, and has this error message, do u know why this 
happen?
  
 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:21 ERROR Worker: Worker registration failed: Duplicate worker ID

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
I guess 2.0 would be released before Spark Summit.

On Mon, May 16, 2016 at 7:19 PM, sunday2000 <2314476...@qq.com> wrote:

> Hi,
>   I found the bug status is : Solved, then when will release the solved
> version?
>
>
> -- 原始邮件 --
> *发件人:* "Ted Yu";;
> *发送时间:* 2016年5月17日(星期二) 上午9:58
> *收件人:* "sunday2000"<2314476...@qq.com>;
> *抄送:* "user";
> *主题:* Re: Why spark 1.6.1 master can not monitor and start a auto stop
> worker?
>
> Please take a look at this JIRA:
>
> [SPARK-13604][CORE] Sync worker's state after registering with master
>
> On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote:
>
>> Hi,
>>
>>   A client woker stoppped, and has this error message, do u know why this
>> happen?
>>
>> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
>> with the master, since there is an attempt scheduled already.
>> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
>> requested this worker to reconnect.
>> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
>> with the master, since there is an attempt scheduled already.
>> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
>> requested this worker to reconnect.
>> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
>> with the master, since there is an attempt scheduled already.
>> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
>> requested this worker to reconnect.
>> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
>> with the master, since there is an attempt scheduled already.
>> 16/05/17 03:42:21 ERROR Worker: Worker registration failed: Duplicate
>> worker ID
>>
>
>


?????? Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
Hi,
   I found the bug status is : Solved, then when will release the solved 
version?
  

 

 --  --
  ??: "Ted Yu";;
 : 2016??5??17??(??) 9:58
 ??: "sunday2000"<2314476...@qq.com>; 
 : "user"; 
 : Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

 

 Please take a look at this JIRA:  
 [SPARK-13604][CORE] Sync worker's state after registering with master



 
 On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote:
  Hi,
  
   A client woker stoppped, and has this error message, do u know why this 
happen?
  
 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:21 ERROR Worker: Worker registration failed: Duplicate worker ID

Re: Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread Ted Yu
Please take a look at this JIRA:

[SPARK-13604][CORE] Sync worker's state after registering with master

On Mon, May 16, 2016 at 6:54 PM, sunday2000 <2314476...@qq.com> wrote:

> Hi,
>
>   A client woker stoppped, and has this error message, do u know why this
> happen?
>
> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
> with the master, since there is an attempt scheduled already.
> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
> requested this worker to reconnect.
> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
> with the master, since there is an attempt scheduled already.
> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
> requested this worker to reconnect.
> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
> with the master, since there is an attempt scheduled already.
> 16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077
> requested this worker to reconnect.
> 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register
> with the master, since there is an attempt scheduled already.
> 16/05/17 03:42:21 ERROR Worker: Worker registration failed: Duplicate
> worker ID
>


Why spark 1.6.1 master can not monitor and start a auto stop worker?

2016-05-16 Thread sunday2000
Hi,
  
   A client woker stoppped, and has this error message, do u know why this 
happen?
  
 16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:20 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/17 03:42:20 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/17 03:42:21 ERROR Worker: Worker registration failed: Duplicate worker ID

Re: How to get the batch information from Streaming UI

2016-05-16 Thread John Trengrove
You would want to add a listener to your Spark Streaming context. Have a
look at the StatsReportListener [1].

[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StatsReportListener

2016-05-17 7:18 GMT+10:00 Samuel Zhou :

> Hi,
>
> Does anyone know how to get the batch information(like batch time, input
> size, processing time, status) from Streaming UI by using Scala/Java API ?
> Because I want to put the information in log files and the streaming jobs
> are managed by YARN.
>
> Thanks,
> Samuel
>


Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out
Tachyon / Alluxio.

For the Thrift server, I believe the datasets are located in your Spark
cluster as RDDs and you just communicate with it via the Thrift
JDBC Distributed Query Engine connector.

2016-05-17 5:12 GMT+10:00 Michael Segel :

> For one use case.. we were considering using the thrift server as a way to
> allow multiple clients access shared RDDs.
>
> Within the Thrift Context, we create an RDD and expose it as a hive table.
>
> The question  is… where does the RDD exist. On the Thrift service node
> itself, or is that just a reference to the RDD which is contained with
> contexts on the cluster?
>
>
> Thx
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Ramaswamy, Muthuraman
Yes, I can see the messages. Also, I wrote a quick custom decoder for avro and 
it works fine for the following:

>> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
>> brokers}, valueDecoder=decoder)

But, when I use the Confluent Serializers to leverage the Schema Registry 
(based on the link shown below), it doesn’t work for me. I am not sure whether 
I need to configure any more details to consume the Schema Registry. I can 
fetch the schema from the schema registry based on is Ids. The decoder method 
is not returning any values for me.

~Muthu



On 5/16/16, 10:49 AM, "Cody Koeninger"  wrote:

>Have you checked to make sure you can receive messages just using a
>byte array for value?
>
>On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
> wrote:
>> I am trying to consume AVRO formatted message through
>> KafkaUtils.createDirectStream. I followed the listed below example (refer
>> link) but the messages are not being fetched by the Stream.
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>  
>>
>> Is there any code missing that I must add to make the above sample work.
>> Say, I am not sure how the confluent serializers would know the avro schema
>> info as it knows only the Schema Registry URL info.
>>
>> Appreciate your help.
>>
>> ~Muthu
>>
>>
>>


Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
Good news - and Java 8 as well. I saw Matei after his talk at Scala days
and he said he would look into a 2.11 default but it seems that is already
the plan. Scala 2.12 is getting closer as well.

On Mon, May 16, 2016 at 2:55 PM, Ted Yu  wrote:

> For 2.0, I believe that is the case.
>
> Jenkins jobs have been running against Scala 2.11:
>
> [INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ 
> java8-tests_2.11 ---
>
>
> FYI
>
>
> On Mon, May 16, 2016 at 2:45 PM, Eric Richardson 
> wrote:
>
>> On Thu, May 12, 2016 at 9:23 PM, Luciano Resende 
>> wrote:
>>
>>> Spark has moved to build using Scala 2.11 by default in master/trunk.
>>>
>>
>> Does this mean that the pre-built binaries for download will also move to
>> 2.11 as well?
>>
>>
>>>
>>>
>>> As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk
>>> and you might be missing some modules/profiles for your build. What command
>>> did you use to build ?
>>>
>>> On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <
>>> m.vijayaragh...@gmail.com> wrote:
>>>
 Hello All,

 I built Spark from the source code available at
 https://github.com/apache/spark/. Although I haven't specified the
 "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
 see that it ended up using Scala 2.11. Now, for my application sbt, what
 should be the spark version? I tried the following

 val spark = "org.apache.spark" %% "spark-core" % "2.0.0-SNAPSHOT"
 val sparksql = "org.apache.spark" % "spark-sql_2.11" % "2.0.0-SNAPSHOT"

 and scalaVersion := "2.11.8"

 But this setting of spark version gives sbt error

 unresolved dependency: org.apache.spark#spark-core_2.11;2.0.0-SNAPSHOT

 I guess this is because the repository doesn't contain 2.0.0-SNAPSHOT.
 Does this mean, the only option is to put all the required jars in the lib
 folder (unmanaged dependencies)?

 Regards,
 Raghava.

>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>


Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Ted Yu
For 2.0, I believe that is the case.

Jenkins jobs have been running against Scala 2.11:

[INFO] --- scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) @ java8-tests_2.11 ---


FYI


On Mon, May 16, 2016 at 2:45 PM, Eric Richardson 
wrote:

> On Thu, May 12, 2016 at 9:23 PM, Luciano Resende 
> wrote:
>
>> Spark has moved to build using Scala 2.11 by default in master/trunk.
>>
>
> Does this mean that the pre-built binaries for download will also move to
> 2.11 as well?
>
>
>>
>>
>> As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk and
>> you might be missing some modules/profiles for your build. What command did
>> you use to build ?
>>
>> On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <
>> m.vijayaragh...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> I built Spark from the source code available at
>>> https://github.com/apache/spark/. Although I haven't specified the
>>> "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
>>> see that it ended up using Scala 2.11. Now, for my application sbt, what
>>> should be the spark version? I tried the following
>>>
>>> val spark = "org.apache.spark" %% "spark-core" % "2.0.0-SNAPSHOT"
>>> val sparksql = "org.apache.spark" % "spark-sql_2.11" % "2.0.0-SNAPSHOT"
>>>
>>> and scalaVersion := "2.11.8"
>>>
>>> But this setting of spark version gives sbt error
>>>
>>> unresolved dependency: org.apache.spark#spark-core_2.11;2.0.0-SNAPSHOT
>>>
>>> I guess this is because the repository doesn't contain 2.0.0-SNAPSHOT.
>>> Does this mean, the only option is to put all the required jars in the lib
>>> folder (unmanaged dependencies)?
>>>
>>> Regards,
>>> Raghava.
>>>
>>
>>
>>
>> --
>> Luciano Resende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
On Thu, May 12, 2016 at 9:23 PM, Luciano Resende 
wrote:

> Spark has moved to build using Scala 2.11 by default in master/trunk.
>

Does this mean that the pre-built binaries for download will also move to
2.11 as well?


>
>
> As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk and
> you might be missing some modules/profiles for your build. What command did
> you use to build ?
>
> On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> I built Spark from the source code available at
>> https://github.com/apache/spark/. Although I haven't specified the
>> "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
>> see that it ended up using Scala 2.11. Now, for my application sbt, what
>> should be the spark version? I tried the following
>>
>> val spark = "org.apache.spark" %% "spark-core" % "2.0.0-SNAPSHOT"
>> val sparksql = "org.apache.spark" % "spark-sql_2.11" % "2.0.0-SNAPSHOT"
>>
>> and scalaVersion := "2.11.8"
>>
>> But this setting of spark version gives sbt error
>>
>> unresolved dependency: org.apache.spark#spark-core_2.11;2.0.0-SNAPSHOT
>>
>> I guess this is because the repository doesn't contain 2.0.0-SNAPSHOT.
>> Does this mean, the only option is to put all the required jars in the lib
>> folder (unmanaged dependencies)?
>>
>> Regards,
>> Raghava.
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


How to get the batch information from Streaming UI

2016-05-16 Thread Samuel Zhou
Hi,

Does anyone know how to get the batch information(like batch time, input
size, processing time, status) from Streaming UI by using Scala/Java API ?
Because I want to put the information in log files and the streaming jobs
are managed by YARN.

Thanks,
Samuel


Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what 

[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 

[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 


> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
>  wrote:
> 
> Hi,
> 
> We can see in [2] many interesting (and expected!) improvements (promises) 
> like extended SQL support, unified API (DataFrames, DataSets), improved 
> engine (Tungsten relates to ideas from modern compilers and MPP databases - 
> similar to Flink [3]), structured streaming etc. It seems we somehow assist 
> at a smart unification of Big Data analytics (Spark, Flink - best of two 
> worlds)!
> 
> How does Spark respond to the missing What/Where/When/How questions 
> (capabilities) highlighted in the unified model Beam [1] ?
> 
> Best,
> Ovidiu
> 
> [1] 
> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
>  
> 
> [2] 
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>  
> 
> [3] http://stratosphere.eu/project/publications/ 
> 
> 
> 



Re: Monitoring Spark application progress

2016-05-16 Thread Василец Дмитрий
hello
use google translate and
https://mkdev.me/posts/ci-i-monitoring-spark-prilozheniy

On Mon, May 16, 2016 at 6:13 PM, Ashok Kumar 
wrote:

> Hi,
>
> I would like to know the approach and tools please to get the full
> performance for a Spark app running through Spark-shell and Spark-sumbit
>
>
>1. Through Spark GUI at 4040?
>2. Through OS utilities top, SAR
>3. Through Java tools like jbuilder etc
>4. Through integration Spark with monitoring tools.
>
>
> Thanks
>


Re: broadcast variable not picked up

2016-05-16 Thread Davies Liu
broadcast_var is only defined in foo(), I think you should have `global` for it.

def foo():
   global broadcast_var
   broadcast_var = sc.broadcast(var)

On Fri, May 13, 2016 at 3:53 PM, abi  wrote:
> def kernel(arg):
> input = broadcast_var.value + 1
> #some processing with input
>
> def foo():
>   
>   
>   broadcast_var = sc.broadcast(var)
>   rdd.foreach(kernel)
>
>
> def main():
>#something
>
>
> In this code , I get the following error:
> NameError: global name 'broadcast_var ' is not defined
>
>
> Any ideas on how to fix it ?
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-not-picked-up-tp26955.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to 
allow multiple clients access shared RDDs. 

Within the Thrift Context, we create an RDD and expose it as a hive table. 

The question  is… where does the RDD exist. On the Thrift service node itself, 
or is that just a reference to the RDD which is contained with contexts on the 
cluster? 


Thx

-Mike


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia
I don't think any of the developers use this as an official channel, but all 
the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can 
document this on the website and say that it's mostly for users to find other 
users. Development discussions should happen on the dev mailing list and JIRA 
so that they can easily be archived and found afterward.

Matei

> On May 16, 2016, at 1:06 PM, Dood@ODDO  wrote:
> 
> On 5/16/2016 9:52 AM, Xinh Huynh wrote:
>> I just went to IRC. It looks like the correct channel is #apache-spark.
>> So, is this an "official" chat room for Spark?
>> 
> 
> Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is an 
> official channel on IRC for spark :-)
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka stream message sampling

2016-05-16 Thread Samuel Zhou
Hi, Mich,

I created the Kafka DStream with following Java code:

sparkConf = new SparkConf().setAppName(this.getClass().getSimpleName() + ",
topic: " + topics);

jssc = new JavaStreamingContext(sparkConf, Durations.seconds(batchInterval
));

HashSet topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));

HashMap kafkaParams = new HashMap<>();

kafkaParams.put("metadata.broker.list", brokers);

dStream = KafkaUtils.createDirectStream(jssc, byte[].class, byte[].class,
DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);

Do you know if there a way to do sampling for a stream when creating it?

Thanks,

Samuel

On Mon, May 16, 2016 at 12:54 AM, Mich Talebzadeh  wrote:

> Hi Samuel,
>
> How do you create your RDD based on Kakfa direct stream?
>
> Do you have your code snippet?
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 23:24, Samuel Zhou  wrote:
>
>> Hi,
>>
>> I was trying to use filter to sampling a Kafka direct stream, and the
>> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
>> but the number of events of input for each batch didn't shrink to 10% of
>> original traffic. So I want to ask if there are any way to shrink the batch
>> size by a sampling function to save the traffic from Kafka?
>>
>> Thanks!
>> Samuel
>>
>
>


Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Cody Koeninger
Have you checked to make sure you can receive messages just using a
byte array for value?

On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
 wrote:
> I am trying to consume AVRO formatted message through
> KafkaUtils.createDirectStream. I followed the listed below example (refer
> link) but the messages are not being fetched by the Stream.
>
> http://stackoverflow.com/questions/30339636/spark-python-avro-kafka-deserialiser
>
> Is there any code missing that I must add to make the above sample work.
> Say, I am not sure how the confluent serializers would know the avro schema
> info as it knows only the Schema Registry URL info.
>
> Appreciate your help.
>
> ~Muthu
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Ramaswamy, Muthuraman
I am trying to consume AVRO formatted message through 
KafkaUtils.createDirectStream. I followed the listed below example (refer link) 
but the messages are not being fetched by the Stream.

http://stackoverflow.com/questions/30339636/spark-python-avro-kafka-deserialiser

Is there any code missing that I must add to make the above sample work. Say, I 
am not sure how the confluent serializers would know the avro schema info as it 
knows only the Schema Registry URL info.

Appreciate your help.

~Muthu





how to add one more column in DataFrame

2016-05-16 Thread Zhiliang Zhu
Hi All,
For the given DataFrame created by hive sql, however, then it is required to 
add one more column based on the existing column, and should also keep the 
previous columns there for the result DataFrame.

final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0;
//DAYS_30 seems difficult to call in the sql ? 
DataFrame behavior_df = jhql.sql("SELECT cast (user_id as double) as user_id, 
cast (server_timestamp as 
   double) as server_timestamp, url, referer, source, 
app_version, params FROM log.request");
//it is okay to run, but behavior_df.printSchema() not changed any
behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));

//it is okay to run, but behavior_df.printSchema() only has one column as 
daysLater30 .//it would be the schema is with the previous all columns and 
added one as daysLater30 
behavior_df = behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));
Then, how would do it?
Thank you, 



Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Dood

On 5/16/2016 9:53 AM, Yuval Itzchakov wrote:


AFAIK, the underlying data represented under the DataSet[T] 
abstraction will be formatted in Tachyon under the hood, but as with 
RDD's if needed they will be spilled to local disk on the worker of 
needed.





There is another option in case of RDDs - the Apache Ignite project - a 
memory grid/distributed cache that supports Spark RDDs. The nice thing 
about Ignite is that everything is done automatically for you, you can 
also duplicate caches for resiliency, load caches from disk, partition 
them etc. and you also get automatic spillover to SQL (and NoSQL) 
capable backends via read/write through capabilities. I think there is 
also effort to support dataframes. Ignite supports standard SQL to query 
the caches too.


On Mon, May 16, 2016, 19:47 Benjamin Kim > wrote:


I have a curiosity question. These forever/unlimited
DataFrames/DataSets will persist and be query capable. I still am
foggy about how this data will be stored. As far as I know, memory
is finite. Will the data be spilled to disk and be retrievable if
the query spans data not in memory? Is Tachyon (Alluxio), HDFS
(Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL),
Object Store (S3, Swift), or any else I can’t think of going to be
the underlying near real-time storage system?

Thanks,
Ben



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 9:52 AM, Xinh Huynh wrote:

I just went to IRC. It looks like the correct channel is #apache-spark.
So, is this an "official" chat room for Spark?



Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is 
an official channel on IRC for spark :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
AFAIK, the underlying data represented under the DataSet[T] abstraction
will be formatted in Tachyon under the hood, but as with RDD's if needed
they will be spilled to local disk on the worker of needed.

On Mon, May 16, 2016, 19:47 Benjamin Kim  wrote:

> I have a curiosity question. These forever/unlimited DataFrames/DataSets
> will persist and be query capable. I still am foggy about how this data
> will be stored. As far as I know, memory is finite. Will the data be
> spilled to disk and be retrievable if the query spans data not in memory?
> Is Tachyon (Alluxio), HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS
> (PostgreSQL, MySQL), Object Store (S3, Swift), or any else I can’t think of
> going to be the underlying near real-time storage system?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 3:36 PM, Yuval Itzchakov  wrote:
>
> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they
> do a light touch on infinite Dataframes/Datasets. However, they do not go
> in depth as regards to how existing transformations on DStreams, for
> example, will be transformed into the Dataset APIs. I've been browsing the
> 2.0 branch and have yet been able to understand how they correlate.
>
> Also, placing SparkSession in the sql package seems like a peculiar
> choice, since this is going to be the global abstraction over
> SparkContext/StreamingContext from now on.
>
> On Sun, May 15, 2016, 23:42 Ofir Manor  wrote:
>
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>> (merging of Dataset and Dataframe - which is why it inherits all the
>> SparkSQL goodness), while RDD seems as a low-level API only for special
>> cases. The new Dataset should also support both batch and streaming -
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>> However, as you noted, not all will be fully delivered in 2.0. For
>> example, it seems that streaming from / to Kafka using StructuredStreaming
>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>> Anyway, as far as I understand, you should be able to apply stateful
>> operators (non-RDD) on Datasets (for example, the new event-time window
>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>> examples will align with the current offering...
>>
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
>> wrote:
>>
>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>> which
>>> brings us Structured Streaming. One thing I've yet to understand is how
>>> this
>>> relates to the current state of working with Streaming in Spark with the
>>> DStream abstraction.
>>>
>>> All examples I can find, in the Spark repository/different videos is
>>> someone
>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>> browsing
>>> the source, SparkSession seems to be defined inside
>>> org.apache.spark.sql, so
>>> this gives me a hunch that this is somehow all related to SQL and the
>>> likes,
>>> and not really to DStreams.
>>>
>>> What I'm failing to understand is: Will this feature impact how we do
>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>> we
>>> be able to do state-full operations on a Dataset[T] like we do today
>>> using
>>> MapWithStateRDD? Or will there be a subset of operations that the
>>> catalyst
>>> optimizer can understand such as aggregate and such?
>>>
>>> I'd be happy anyone could shed some light on this.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: Apache Spark Slack

2016-05-16 Thread Xinh Huynh
I just went to IRC. It looks like the correct channel is #apache-spark.
So, is this an "official" chat room for Spark?

Xinh


On Mon, May 16, 2016 at 9:35 AM, Dood@ODDO  wrote:

> On 5/16/2016 9:30 AM, Paweł Szulc wrote:
>
>>
>> Just realized that people have to be invited to this thing. You see,
>> that's why Gitter is just simpler.
>>
>> I will try to figure it out ASAP
>>
>>
> You don't need invitations to IRC and it has been around for decades. You
> can just go to webchat.freenode.net and login into the #spark channel (or
> you can use CLI based clients). In addition, Gitter is owned by a private
> entity, it too requires an account and - what does it give you that is
> advantageous? You wanted real-time chat about Spark - IRC has it and the
> channel has already been around for a while :-)
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim
I have a curiosity question. These forever/unlimited DataFrames/DataSets will 
persist and be query capable. I still am foggy about how this data will be 
stored. As far as I know, memory is finite. Will the data be spilled to disk 
and be retrievable if the query spans data not in memory? Is Tachyon (Alluxio), 
HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL), Object 
Store (S3, Swift), or any else I can’t think of going to be the underlying near 
real-time storage system?

Thanks,
Ben

> On May 15, 2016, at 3:36 PM, Yuval Itzchakov  wrote:
> 
> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they do a 
> light touch on infinite Dataframes/Datasets. However, they do not go in depth 
> as regards to how existing transformations on DStreams, for example, will be 
> transformed into the Dataset APIs. I've been browsing the 2.0 branch and have 
> yet been able to understand how they correlate.
> 
> Also, placing SparkSession in the sql package seems like a peculiar choice, 
> since this is going to be the global abstraction over 
> SparkContext/StreamingContext from now on.
> 
> On Sun, May 15, 2016, 23:42 Ofir Manor  > wrote:
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main 
> ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging 
> of Dataset and Dataframe - which is why it inherits all the SparkSQL 
> goodness), while RDD seems as a low-level API only for special cases. The new 
> Dataset should also support both batch and streaming - replacing (eventually) 
> DStream as well. See the design docs in SPARK-13485 (unified API) and 
> SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, 
> it seems that streaming from / to Kafka using StructuredStreaming didn't make 
> it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful 
> operators (non-RDD) on Datasets (for example, the new event-time window 
> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
> will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov  > wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 



Re: How to get and save core dump of native library in executors

2016-05-16 Thread prateek arora
Please help to solve my problem .

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945p26967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient for loops in Spark

2016-05-16 Thread Erik Erlandson

Regarding the specific problem of generating random folds in a more efficient 
way, this should help:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions

It uses a sort of multiplexing formalism on RDDs:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions

I wrote a blog post to explain the idea here:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/



- Original Message -
> Hi there,
> 
> I'd like to write some iterative computation, i.e., computation that can be
> done via a for loop. I understand that in Spark foreach is a better choice.
> However, foreach and foreachPartition seem to be for self-contained
> computation that only involves the corresponding Row or Partition,
> respectively. But in my application each computational task does not only
> involve one partition, but also other partitions. It's just that every task
> has a specific way of using the corresponding partition and the other
> partitions. An example application will be cross-validation in machine
> learning, where each fold corresponds to a partition, e.g., the whole data
> is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
> 2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
> for training; etc.
> 
> In this case, if I were to use foreachPartition, it seems that I need to
> duplicate the data the number of times equal to the number of folds (or
> iterations of my for loop). More generally, I would need to still prepare a
> partition for every distributed task and that partition would need to
> include all the data needed for the task, which could be a huge waste of
> space.
> 
> Is there any other solutions? Thanks.
> 
> f.
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 9:30 AM, Paweł Szulc wrote:


Just realized that people have to be invited to this thing. You see,  
that's why Gitter is just simpler.


I will try to figure it out ASAP



You don't need invitations to IRC and it has been around for decades. 
You can just go to webchat.freenode.net and login into the #spark 
channel (or you can use CLI based clients). In addition, Gitter is owned 
by a private entity, it too requires an account and - what does it give 
you that is advantageous? You wanted real-time chat about Spark - IRC 
has it and the channel has already been around for a while :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Paweł Szulc
Just realized that people have to be invited to this thing. You see,
that's why Gitter is just simpler.

I will try to figure it out ASAP
16 maj 2016 15:40 "Paweł Szulc"  napisał(a):

> I've just created this https://apache-spark.slack.com for ad-hoc
> communications within the comunity.
>
> Everybody's welcome!
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com
>


Re: Monitoring Spark application progress

2016-05-16 Thread Василец Дмитрий
spark + zabbix + jmx
https://translate.google.ru/translate?sl=ru=en=y=_t=en=UTF-8=https%3A%2F%2Fmkdev.me%2Fposts%2Fci-i-monitoring-spark-prilozheniy=

On Mon, May 16, 2016 at 6:13 PM, Ashok Kumar 
wrote:

> Hi,
>
> I would like to know the approach and tools please to get the full
> performance for a Spark app running through Spark-shell and Spark-sumbit
>
>
>1. Through Spark GUI at 4040?
>2. Through OS utilities top, SAR
>3. Through Java tools like jbuilder etc
>4. Through integration Spark with monitoring tools.
>
>
> Thanks
>


Monitoring Spark application progress

2016-05-16 Thread Ashok Kumar
Hi,
I would like to know the approach and tools please to get the full performance 
for a Spark app running through Spark-shell and Spark-sumbit
   
   - Through Spark GUI at 4040?
   - Through OS utilities top, SAR 
   - Through Java tools like jbuilder etc
   - Through integration Spark with monitoring tools.

Thanks

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
To understand the issue, you need to describe more about your case;
what's the version of spark you use and what's your job?
Also, what if you directly use scala interfaces instead of python ones?

On Mon, May 16, 2016 at 11:56 PM, Aleksandr Modestov <
aleksandrmodes...@gmail.com> wrote:

> Hi,
>
> "Why did you though you have enough memory for your task? You checked task
> statistics in your WebUI?". I mean that I have jnly about 5Gb data but
> spark.driver memory in 60Gb. I check task statistics in web UI.
> But really spark says that
> *"05-16 17:50:06.254 127.0.0.1:54321    1534
> #e Thread WARN: Swapping!  GC CALLBACK, (K/V:29.74 GB + POJO:18.97 GB +
> FREE:8.79 GB == MEM_MAX:57.50 GB), desiredKV=7.19 GB OOM!Exception in
> thread "Heartbeat" java.lang.OutOfMemoryError: Java heap space"*
> But why spark doesn't split data into a disk?
>
> On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Why did you though you have enough memory for your task? You checked task
>> statistics in your WebUI?
>> Anyway, If you get stuck with the GC issue, you'd better off increasing
>> the number of partitions.
>>
>> // maropu
>>
>> On Mon, May 16, 2016 at 10:00 PM, AlexModestov <
>> aleksandrmodes...@gmail.com> wrote:
>>
>>> I get the error in the apache spark...
>>>
>>> "spark.driver.memory 60g
>>> spark.python.worker.memory 60g
>>> spark.master local[*]"
>>>
>>> The amount of data is about 5Gb, but spark says that "GC overhead limit
>>> exceeded". I guess that my conf-file gives enought resources.
>>>
>>> "16/05/16 15:13:02 WARN NettyRpcEndpointRef: Error sending message
>>> [message
>>> = Heartbeat(driver,[Lscala.Tuple2;@87576f9,BlockManagerId(driver,
>>> localhost,
>>> 59407))] in 1 attempts
>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
>>> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>>> at
>>> org.apache.spark.rpc.RpcTimeout.org
>>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>>> at
>>>
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>>> at
>>>
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>> at
>>>
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>>> at
>>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>>> at
>>>
>>> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>>> at
>>> org.apache.spark.executor.Executor.org
>>> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
>>> at
>>>
>>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
>>> at
>>>
>>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>>> at
>>>
>>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>>> at
>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>>> at
>>> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
>>> at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>> at
>>>
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>> at
>>>
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>>> [10 seconds]
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>> at
>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>> at
>>>
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>> at scala.concurrent.Await$.result(package.scala:107)
>>> at
>>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>>> ... 14 more
>>> 16/05/16 15:13:02 WARN NettyRpcEnv: Ignored message:
>>> HeartbeatResponse(false)
>>> 05-16 15:13:26.398 127.0.0.1:54321   2059   #e Thread WARN:
>>> Swapping!
>>> GC CALLBACK, (K/V:29.74 GB + POJO:16.74 GB + FREE:11.03 GB ==
>>> MEM_MAX:57.50
>>> GB), desiredKV=7.19 GB OOM!
>>> 05-16 15:13:44.528 127.0.0.1:54321   2059   #e Thread WARN:
>>> Swapping!
>>> GC CALLBACK, (K/V:29.74 GB + POJO:16.86 GB + FREE:10.90 

Re: GC overhead limit exceeded

2016-05-16 Thread Aleksandr Modestov
Hi,

"Why did you though you have enough memory for your task? You checked task
statistics in your WebUI?". I mean that I have jnly about 5Gb data but
spark.driver memory in 60Gb. I check task statistics in web UI.
But really spark says that
*"05-16 17:50:06.254 127.0.0.1:54321    1534
#e Thread WARN: Swapping!  GC CALLBACK, (K/V:29.74 GB + POJO:18.97 GB +
FREE:8.79 GB == MEM_MAX:57.50 GB), desiredKV=7.19 GB OOM!Exception in
thread "Heartbeat" java.lang.OutOfMemoryError: Java heap space"*
But why spark doesn't split data into a disk?

On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Why did you though you have enough memory for your task? You checked task
> statistics in your WebUI?
> Anyway, If you get stuck with the GC issue, you'd better off increasing
> the number of partitions.
>
> // maropu
>
> On Mon, May 16, 2016 at 10:00 PM, AlexModestov <
> aleksandrmodes...@gmail.com> wrote:
>
>> I get the error in the apache spark...
>>
>> "spark.driver.memory 60g
>> spark.python.worker.memory 60g
>> spark.master local[*]"
>>
>> The amount of data is about 5Gb, but spark says that "GC overhead limit
>> exceeded". I guess that my conf-file gives enought resources.
>>
>> "16/05/16 15:13:02 WARN NettyRpcEndpointRef: Error sending message
>> [message
>> = Heartbeat(driver,[Lscala.Tuple2;@87576f9,BlockManagerId(driver,
>> localhost,
>> 59407))] in 1 attempts
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
>> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>> at
>> org.apache.spark.rpc.RpcTimeout.org
>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at
>>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>> at
>> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>> at
>> org.apache.spark.executor.Executor.org
>> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
>> at
>>
>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
>> at
>>
>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>> at
>>
>> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>> at
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>> at
>> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at
>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>> at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>> at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [10 seconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>>
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>> ... 14 more
>> 16/05/16 15:13:02 WARN NettyRpcEnv: Ignored message:
>> HeartbeatResponse(false)
>> 05-16 15:13:26.398 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
>> GC CALLBACK, (K/V:29.74 GB + POJO:16.74 GB + FREE:11.03 GB ==
>> MEM_MAX:57.50
>> GB), desiredKV=7.19 GB OOM!
>> 05-16 15:13:44.528 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
>> GC CALLBACK, (K/V:29.74 GB + POJO:16.86 GB + FREE:10.90 GB ==
>> MEM_MAX:57.50
>> GB), desiredKV=7.19 GB OOM!
>> 05-16 15:13:56.847 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
>> GC CALLBACK, (K/V:29.74 GB + POJO:16.88 GB + FREE:10.88 GB ==
>> MEM_MAX:57.50
>> GB), desiredKV=7.19 GB OOM!
>> 05-16 15:14:10.215 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
>> GC CALLBACK, (K/V:29.74 GB + POJO:16.90 GB + FREE:10.86 GB ==
>> MEM_MAX:57.50
>> GB), desiredKV=7.19 GB 

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
Hi,

Why did you though you have enough memory for your task? You checked task
statistics in your WebUI?
Anyway, If you get stuck with the GC issue, you'd better off increasing the
number of partitions.

// maropu

On Mon, May 16, 2016 at 10:00 PM, AlexModestov 
wrote:

> I get the error in the apache spark...
>
> "spark.driver.memory 60g
> spark.python.worker.memory 60g
> spark.master local[*]"
>
> The amount of data is about 5Gb, but spark says that "GC overhead limit
> exceeded". I guess that my conf-file gives enought resources.
>
> "16/05/16 15:13:02 WARN NettyRpcEndpointRef: Error sending message [message
> = Heartbeat(driver,[Lscala.Tuple2;@87576f9,BlockManagerId(driver,
> localhost,
> 59407))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
> at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> at
> org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
> at
> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [10 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> ... 14 more
> 16/05/16 15:13:02 WARN NettyRpcEnv: Ignored message:
> HeartbeatResponse(false)
> 05-16 15:13:26.398 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.74 GB + FREE:11.03 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:13:44.528 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.86 GB + FREE:10.90 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:13:56.847 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.88 GB + FREE:10.88 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:14:10.215 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.90 GB + FREE:10.86 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:14:33.622 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.91 GB + FREE:10.85 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:14:47.075 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:15:10.555 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.92 GB + FREE:10.84 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:15:25.520 127.0.0.1:54321   2059   #e Thread WARN: Swapping!
> GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
> GB), desiredKV=7.19 GB OOM!
> 05-16 15:15:39.087 127.0.0.1:54321   

Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 6:40 AM, Paweł Szulc wrote:
I've just created this https://apache-spark.slack.com for ad-hoc 
communications within the comunity.


Everybody's welcome!


Why not just IRC? Slack is yet another place to create an account etc. - 
IRC is much easier. What does Slack give you that's so very special? :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: apache spark on gitter?

2016-05-16 Thread Sean Owen
We'll have to be careful to not present this as an official project
chat, since it's not affiliated. Even calling it
apache-spark.slack.com is potentially problematic since it gives some
impression it's from the ASF.

Ideally, call this something that is unambiguously not related to
Apache itself, even as it is possible to make it clear it does refer
to Apache Spark.

Since this has come up in recent memory, I have a good link handy for
the interested:

http://www.apache.org/foundation/marks/

On Mon, May 16, 2016 at 2:41 PM, Paweł Szulc  wrote:
> I've just created https://apache-spark.slack.com
>
> On Thu, May 12, 2016 at 9:28 AM, Paweł Szulc  wrote:
>>
>> Hi,
>>
>> well I guess the advantage of gitter over maling list is the same as with
>> IRC. It's not actually a replacer because mailing list is also important.
>> But it is lot easier to build a community around tool with ad-hoc ability to
>> connect with each other.
>>
>> I have gitter running on constantly, I visit my favorite OSS projects on
>> it from time to time to read what has recently happened. It allows me to
>> stay in touch with the project, help fellow developers to with problems they
>> have.
>> One might argue that u can achive the same with mailing list, well it's
>> hard for me to put this into words.. Malinig list is more of an async nature
>> (which is good!) but some times you need more "real-time" experience. You
>> know, engage in the conversation in the given moment, not conversation that
>> might last few days :)
>>
>> TLDR: It is not a replacement, it's supplement to build the community
>> around OSS. Worth having for real-time conversations.
>>
>> On Wed, May 11, 2016 at 10:24 PM, Xinh Huynh  wrote:
>>>
>>> Hi Pawel,
>>>
>>> I'd like to hear more about your idea. Could you explain more why you
>>> would like to have a gitter channel? What are the advantages over a mailing
>>> list (like this one)? Have you had good experiences using gitter on other
>>> open source projects?
>>>
>>> Xinh
>>>
>>> On Wed, May 11, 2016 at 11:10 AM, Sean Owen  wrote:

 I don't know of a gitter channel and I don't use it myself, FWIW. I
 think anyone's welcome to start one.

 I hesitate to recommend this, simply because it's preferable to have
 one place for discussion rather than split it over several, and, we
 have to keep the @spark.apache.org mailing lists as the "forums of
 records" for project discussions.

 If something like gitter doesn't attract any chat, then it doesn't add
 any value. If it does though, then suddenly someone needs to subscribe
 to user@ and gitter to follow all of the conversations.

 I think there is a bit of a scalability problem on the user@ list at
 the moment, just because it covers all of Spark. But adding a
 different all-Spark channel doesn't help that.

 Anyway maybe that's "why"


 On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc 
 wrote:
 > no answer, but maybe one more time, a gitter channel for spark users
 > would
 > be a good idea!
 >
 > On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc 
 > wrote:
 >>
 >> Hi,
 >>
 >> I was wondering - why Spark does not have a gitter channel?
 >>
 >> --
 >> Regards,
 >> Paul Szulc
 >>
 >> twitter: @rabbitonweb
 >> blog: www.rabbitonweb.com
 >
 >
 >
 >
 > --
 > Regards,
 > Paul Szulc
 >
 > twitter: @rabbitonweb
 > blog: www.rabbitonweb.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>
>>
>>
>> --
>> Regards,
>> Paul Szulc
>>
>> twitter: @rabbitonweb
>> blog: www.rabbitonweb.com
>
>
>
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark Slack

2016-05-16 Thread Paweł Szulc
I've just created this https://apache-spark.slack.com for ad-hoc
communications within the comunity.

Everybody's welcome!

-- 
Regards,
Paul Szulc

twitter: @rabbitonweb
blog: www.rabbitonweb.com


Re: apache spark on gitter?

2016-05-16 Thread Paweł Szulc
I've just created https://apache-spark.slack.com

On Thu, May 12, 2016 at 9:28 AM, Paweł Szulc  wrote:

> Hi,
>
> well I guess the advantage of gitter over maling list is the same as with
> IRC. It's not actually a replacer because mailing list is also important.
> But it is lot easier to build a community around tool with ad-hoc ability
> to connect with each other.
>
> I have gitter running on constantly, I visit my favorite OSS projects on
> it from time to time to read what has recently happened. It allows me to
> stay in touch with the project, help fellow developers to with problems
> they have.
> One might argue that u can achive the same with mailing list, well it's
> hard for me to put this into words.. Malinig list is more of an async
> nature (which is good!) but some times you need more "real-time"
> experience. You know, engage in the conversation in the given moment, not
> conversation that might last few days :)
>
> TLDR: It is not a replacement, it's supplement to build the community
> around OSS. Worth having for real-time conversations.
>
> On Wed, May 11, 2016 at 10:24 PM, Xinh Huynh  wrote:
>
>> Hi Pawel,
>>
>> I'd like to hear more about your idea. Could you explain more why you
>> would like to have a gitter channel? What are the advantages over a mailing
>> list (like this one)? Have you had good experiences using gitter on other
>> open source projects?
>>
>> Xinh
>>
>> On Wed, May 11, 2016 at 11:10 AM, Sean Owen  wrote:
>>
>>> I don't know of a gitter channel and I don't use it myself, FWIW. I
>>> think anyone's welcome to start one.
>>>
>>> I hesitate to recommend this, simply because it's preferable to have
>>> one place for discussion rather than split it over several, and, we
>>> have to keep the @spark.apache.org mailing lists as the "forums of
>>> records" for project discussions.
>>>
>>> If something like gitter doesn't attract any chat, then it doesn't add
>>> any value. If it does though, then suddenly someone needs to subscribe
>>> to user@ and gitter to follow all of the conversations.
>>>
>>> I think there is a bit of a scalability problem on the user@ list at
>>> the moment, just because it covers all of Spark. But adding a
>>> different all-Spark channel doesn't help that.
>>>
>>> Anyway maybe that's "why"
>>>
>>>
>>> On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc 
>>> wrote:
>>> > no answer, but maybe one more time, a gitter channel for spark users
>>> would
>>> > be a good idea!
>>> >
>>> > On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc 
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I was wondering - why Spark does not have a gitter channel?
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Paul Szulc
>>> >>
>>> >> twitter: @rabbitonweb
>>> >> blog: www.rabbitonweb.com
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Paul Szulc
>>> >
>>> > twitter: @rabbitonweb
>>> > blog: www.rabbitonweb.com
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com
>



-- 
Regards,
Paul Szulc

twitter: @rabbitonweb
blog: www.rabbitonweb.com


GC overhead limit exceeded

2016-05-16 Thread AlexModestov
I get the error in the apache spark...

"spark.driver.memory 60g
spark.python.worker.memory 60g
spark.master local[*]"

The amount of data is about 5Gb, but spark says that "GC overhead limit
exceeded". I guess that my conf-file gives enought resources.

"16/05/16 15:13:02 WARN NettyRpcEndpointRef: Error sending message [message
= Heartbeat(driver,[Lscala.Tuple2;@87576f9,BlockManagerId(driver, localhost,
59407))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[10 seconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
16/05/16 15:13:02 WARN NettyRpcEnv: Ignored message:
HeartbeatResponse(false)
05-16 15:13:26.398 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.74 GB + FREE:11.03 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:13:44.528 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.86 GB + FREE:10.90 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:13:56.847 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.88 GB + FREE:10.88 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:10.215 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.90 GB + FREE:10.86 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:33.622 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.91 GB + FREE:10.85 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:47.075 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:10.555 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.92 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:25.520 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:39.087 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
Exception in thread "HashSessionScavenger-0" java.lang.OutOfMemoryError: GC
overhead limit exceeded
at
java.util.concurrent.ConcurrentHashMap$ValuesView.iterator(ConcurrentHashMap.java:4683)
at
org.eclipse.jetty.server.session.HashSessionManager.scavenge(HashSessionManager.java:314)
at

Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Ted Yu
>From https://spark.apache.org/docs/latest/monitoring.html#metrics :

   - JmxSink: Registers metrics for viewing in a JMX console.

FYI

On Sun, May 15, 2016 at 11:54 PM, Mich Talebzadeh  wrote:

> Have you tried Spark GUI on 4040. This will show jobs being executed by
> executors is each stage and the line of code as well.
>
> [image: Inline images 1]
>
> Also command line tools like jps and jmonitor
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 May 2016 at 06:25, Deepak Sharma  wrote:
>
>> Hi
>> I have scala program consisting of spark core and spark streaming APIs
>> Is there any open source tool that i can use to debug the program for
>> performance reasons?
>> My primary interest is to find the block of codes that would be exeuted
>> on driver and what would go to the executors.
>> Is there JMX extension of Spark?
>>
>> --
>> Thanks
>> Deepak
>>
>>
>


Re: Lost names of struct fields in UDF

2016-05-16 Thread Alexander Chermenin
Hi. It's surprisingly, but this code solves my problem: private static Column namedStruct(Column... cols) {    List<_expression_> exprs = Arrays.stream(cols)    .flatMap(c ->    Stream.of(    new Literal(UTF8String.fromString(((NamedExpression) c.expr()).name()), DataTypes.StringType),    c.expr()    )    )    .collect(Collectors.toList());    return new Column(new CreateNamedStruct(JavaConversions.asScalaBuffer(exprs).toSeq()));} ... DataFrame profiles = df.select(        column("_id"),        namedStruct(                column("name.first").as("first_name"),                column("name.last").as("last_name"),                column("friends")        ).as("profile"))... Didn't go deep and wasn't looking for any reasons of the problem. Best regards, Alexander Chermenin.Web: http://chermenin.ruMail: a...@chermenin.ru   06.05.2016, 14:19, "Alexander Chermenin" :Hi everybody! This code: DataFrame df = sqlContext.read().json(FILE_NAME); DataFrame profiles = df.select(        column("_id"),        struct(                column("name.first").as("first_name"),                column("name.last").as("last_name"),                column("friends")        ).as("profile")).limit(1); profiles.select(column("_id"), column("profile")).toJavaRDD().collect().forEach(r -> printRowFields(r.getStruct(1))); // #1 sqlContext.udf().register("schema", (UDF1) r -> printRowFields(r), DataTypes.NullType); // #2profiles.select(column("_id"), callUDF("schema", column("profile"))).show(); out: #1:StructField(first_name,StringType,true)StructField(last_name,StringType,true)StructField(friends,ArrayType(StructType(StructField(id,LongType,true), StructField(name,StringType,true)),true),true)#2:StructField(col1,StringType,true)StructField(col2,StringType,true)StructField(i[2],ArrayType(StructType(StructField(id,LongType,true), StructField(name,StringType,true)),true),true) But why names of fields lost in UDF? What's wrong? Best regards, Alex Chermenin. 

Re: Renaming nested columns in dataframe

2016-05-16 Thread Alexander Chermenin
Hello. I think you can use something like this:  df.select(    struct(        column("site.site_id").as("id"),        column("site.site_name").as("name"),        column("site.site_domain").as("domain"),        column("site.site_cat").as("cat"),        struct(            column("site.publisher.site_pub_id").as("id"),            column("site.publisher.site_pub_name").as("name")        ).as("publisher")    ).as("site")) Best regards, Alexander Chermenin.Web: http://chermenin.ruMail: a...@chermenin.ru   16.05.2016, 14:58, "Prashant Bhardwaj" :Hi How can I rename nested columns in dataframe through scala API? Like following schema|-- site: struct (nullable = false)|    |-- site_id: string (nullable = true)|    |-- site_name: string (nullable = true)|    |-- site_domain: string (nullable = true)|    |-- site_cat: array (nullable = true)|    |    |-- element: string (containsNull = false)|    |-- publisher: struct (nullable = false)|    |    |-- site_pub_id: string (nullable = true)|    |    |-- site_pub_name: string (nullable = true) I want to change it to  |-- site: struct (nullable = false)|    |-- id: string (nullable = true)|    |-- name: string (nullable = true)|    |-- domain: string (nullable = true)|    |-- cat: array (nullable = true)|    |    |-- element: string (containsNull = false)|    |-- publisher: struct (nullable = false)|    |    |-- id: string (nullable = true)|    |    |-- name: string (nullable = true)  RegardsPrashant

What / Where / When / How questions in Spark 2.0 ?

2016-05-16 Thread Ovidiu-Cristian MARCU
Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
 

[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
 

[3] http://stratosphere.eu/project/publications/ 





Renaming nested columns in dataframe

2016-05-16 Thread Prashant Bhardwaj
Hi

How can I rename nested columns in dataframe through scala API? Like
following schema

> |-- site: struct (nullable = false)
>
> ||-- site_id: string (nullable = true)
>
> ||-- site_name: string (nullable = true)
>
> ||-- site_domain: string (nullable = true)
>
> ||-- site_cat: array (nullable = true)
>
> |||-- element: string (containsNull = false)
>
> ||-- publisher: struct (nullable = false)
>
> |||-- site_pub_id: string (nullable = true)
>
> |||-- site_pub_name: string (nullable = true)
>

I want to change it to

> |-- site: struct (nullable = false)
>
> ||-- id: string (nullable = true)
>
> ||-- name: string (nullable = true)
>
> ||-- domain: string (nullable = true)
>
> ||-- cat: array (nullable = true)
>
> |||-- element: string (containsNull = false)
>
> ||-- publisher: struct (nullable = false)
>
> |||-- id: string (nullable = true)
>
> |||-- name: string (nullable = true)
>
>
Regards
Prashant


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Oh, that looks neat! Thx, will read up on that.

On Mon, May 16, 2016, 14:10 Ofir Manor  wrote:

> Yuval,
> Not sure what in-scope to land in 2.0, but there is another new infra bit
> to manage state more efficiently called State Store, whose initial version
> is already commited:
>SPARK-13809 - State Store: A new framework for state management for
> computing Streaming Aggregates
> https://issues.apache.org/jira/browse/SPARK-13809
> Eventually the pull request links into the design doc, that discusses the
> limits of updateStateByKey and mapWithState and how that will be
> handled...
>
> At a quick glance at the code, it seems to be used already in streaming
> aggregations.
>
> Just my two cents,
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov 
> wrote:
>
>> Also, re-reading the relevant part from the Structured Streaming
>> documentation (
>> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
>> ):
>> Discretized streams (aka dstream)
>>
>> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
>> are two main challenges with dstream:
>>
>>
>>1.
>>
>>Similar to Storm, it exposes a monotonic system (processing) time
>>metric, and makes support for event time difficult.
>>2.
>>
>>Its APIs are tied to the underlying microbatch execution model, and
>>as a result lead to inflexibilities such as changing the underlying batch
>>interval would require changing the window size.
>>
>>
>> RQ addresses the above:
>>
>>
>>1.
>>
>>RQ operations support both system time and event time.
>>2.
>>
>>RQ APIs are decoupled from the underlying execution model. As a
>>matter of fact, it is possible to implement an alternative engine that is
>>not microbatch-based for RQ.
>>3. In addition, due to the declarative specification of operations,
>>RQ leverages a relational query optimizer and can often generate more
>>efficient query plans.
>>
>>
>> This doesn't seem to attack the actual underlying implementation for how
>> things like "mapWithState" are going to be translated into RQ, and I think
>> thats the hole that's causing my misunderstanding.
>>
>> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov 
>> wrote:
>>
>>> Hi Ofir,
>>> Thanks for the elaborated answer. I have read both documents, where they
>>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>>> in depth as regards to how existing transformations on DStreams, for
>>> example, will be transformed into the Dataset APIs. I've been browsing the
>>> 2.0 branch and have yet been able to understand how they correlate.
>>>
>>> Also, placing SparkSession in the sql package seems like a peculiar
>>> choice, since this is going to be the global abstraction over
>>> SparkContext/StreamingContext from now on.
>>>
>>> On Sun, May 15, 2016, 23:42 Ofir Manor  wrote:
>>>
 Hi Yuval,
 let me share my understanding based on similar questions I had.
 First, Spark 2.x aims to replace a whole bunch of its APIs with just
 two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
 (merging of Dataset and Dataframe - which is why it inherits all the
 SparkSQL goodness), while RDD seems as a low-level API only for special
 cases. The new Dataset should also support both batch and streaming -
 replacing (eventually) DStream as well. See the design docs in SPARK-13485
 (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
 However, as you noted, not all will be fully delivered in 2.0. For
 example, it seems that streaming from / to Kafka using StructuredStreaming
 didn't make it (so far?) to 2.0 (which is a showstopper for me).
 Anyway, as far as I understand, you should be able to apply stateful
 operators (non-RDD) on Datasets (for example, the new event-time window
 processing SPARK-8360). The gap I see is mostly limited streaming sources /
 sinks migrated to the new (richer) API and semantics.
 Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
 examples will align with the current offering...


 Ofir Manor

 Co-Founder & CTO | Equalum

 Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

 On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
 wrote:

> I've been reading/watching videos about the upcoming Spark 2.0 release
> which
> brings us Structured Streaming. One thing I've yet to understand is
> how this
> relates to the current state of working with Streaming in Spark with
> the
> DStream abstraction.
>
> All examples I can find, in the Spark repository/different videos is
> someone
> streaming local JSON 

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Ofir Manor
Yuval,
Not sure what in-scope to land in 2.0, but there is another new infra bit
to manage state more efficiently called State Store, whose initial version
is already commited:
   SPARK-13809 - State Store: A new framework for state management for
computing Streaming Aggregates
https://issues.apache.org/jira/browse/SPARK-13809
Eventually the pull request links into the design doc, that discusses the
limits of updateStateByKey and mapWithState and how that will be
handled...

At a quick glance at the code, it seems to be used already in streaming
aggregations.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov  wrote:

> Also, re-reading the relevant part from the Structured Streaming
> documentation (
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
> ):
> Discretized streams (aka dstream)
>
> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
> are two main challenges with dstream:
>
>
>1.
>
>Similar to Storm, it exposes a monotonic system (processing) time
>metric, and makes support for event time difficult.
>2.
>
>Its APIs are tied to the underlying microbatch execution model, and as
>a result lead to inflexibilities such as changing the underlying batch
>interval would require changing the window size.
>
>
> RQ addresses the above:
>
>
>1.
>
>RQ operations support both system time and event time.
>2.
>
>RQ APIs are decoupled from the underlying execution model. As a matter
>of fact, it is possible to implement an alternative engine that is not
>microbatch-based for RQ.
>3. In addition, due to the declarative specification of operations, RQ
>leverages a relational query optimizer and can often generate more
>efficient query plans.
>
>
> This doesn't seem to attack the actual underlying implementation for how
> things like "mapWithState" are going to be translated into RQ, and I think
> thats the hole that's causing my misunderstanding.
>
> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov  wrote:
>
>> Hi Ofir,
>> Thanks for the elaborated answer. I have read both documents, where they
>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>> in depth as regards to how existing transformations on DStreams, for
>> example, will be transformed into the Dataset APIs. I've been browsing the
>> 2.0 branch and have yet been able to understand how they correlate.
>>
>> Also, placing SparkSession in the sql package seems like a peculiar
>> choice, since this is going to be the global abstraction over
>> SparkContext/StreamingContext from now on.
>>
>> On Sun, May 15, 2016, 23:42 Ofir Manor  wrote:
>>
>>> Hi Yuval,
>>> let me share my understanding based on similar questions I had.
>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>> cases. The new Dataset should also support both batch and streaming -
>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>> However, as you noted, not all will be fully delivered in 2.0. For
>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>> Anyway, as far as I understand, you should be able to apply stateful
>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>> sinks migrated to the new (richer) API and semantics.
>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>> examples will align with the current offering...
>>>
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
>>> wrote:
>>>
 I've been reading/watching videos about the upcoming Spark 2.0 release
 which
 brings us Structured Streaming. One thing I've yet to understand is how
 this
 relates to the current state of working with Streaming in Spark with the
 DStream abstraction.

 All examples I can find, in the Spark repository/different videos is
 someone
 streaming local JSON files or reading from HDFS/S3/SQL. Also, when
 browsing
 the source, SparkSession seems to be defined inside
 org.apache.spark.sql, so
 this gives me a hunch that this is somehow all related to SQL and the
 likes,
 and not really to DStreams.

?????? spark udf can not change a json string to a map

2016-05-16 Thread ??????
hi, Ted.
I found a built-in function called str_to_map, which can transform string to 
map.
But it can not meet my need.


Because my string is maybe a map with a array nested in its value.
for example, map.
I think it can not work fine in my situation.


Cheers



--  --
??: "??";<251922...@qq.com>;
: 2016??5??16??(??) 10:00
??: "Ted Yu"; 
: "user"; 
: ?? spark udf can not change a json string to a map



this is my usecase:
   Another system upload csv files to my system. In csv files, there are 
complicated data types such as map. In order to express complicated data types 
and ordinary string having special characters?? we put urlencoded string in csv 
files.  So we use urlencoded json string to express map,string and array.


second stage:
  load csv files to spark text table. 
###
CREATE TABLE `a_text`(
  parameters  string
)
load data inpath 'XXX' into table a_text;
#
Third stage:
 insert into spark parquet table select from text table. In order to use 
advantage of complicated data types, we use udf to transform a json string to 
map , and put map to table.


CREATE TABLE `a_parquet`(
  parameters   map
)



insert into a_parquet select UDF(parameters ) from a_text;


So do you have any suggestions?












--  --
??: "Ted Yu";;
: 2016??5??16??(??) 0:44
??: "??"<251922...@qq.com>; 
: "user"; 
: Re: spark udf can not change a json string to a map



Can you let us know more about your use case ?

I wonder if you can structure your udf by not returning Map.


Cheers


On Sun, May 15, 2016 at 9:18 AM, ?? <251922...@qq.com> wrote:
Hi, all. I want to implement a udf which is used to change a json string to a 
map.
But some problem occurs. My spark version:1.5.1.




my udf code:

public Map evaluate(final String s) {
if (s == null)
return null;
return getString(s);
}


@SuppressWarnings("unchecked")
public static Map getString(String s) {
try {
String str =  URLDecoder.decode(s, "UTF-8");
ObjectMapper mapper = new ObjectMapper();
Map  map = mapper.readValue(str, 
Map.class);

return map;
} catch (Exception e) {
return new HashMap();
}
}

#
exception infos:


16/05/14 21:05:22 ERROR CliDriver: org.apache.spark.sql.AnalysisException: Map 
type in java is unsupported because JVM type erasure makes spark fail to catch 
key and value types in Map<>; line 1 pos 352
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:230)
at 
org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:107)
at org.apache.spark.sql.hive.HiveSimpleUDF.(hiveUDFs.scala:136)






I have saw that there is a testsuite in spark says spark did not support this 
kind of udf.
But is there a method to implement this udf?

Issue with creation of EC2 cluster using spark scripts

2016-05-16 Thread Marco Mistroni
hi all
 i am experiencing issues when creating ec2 clusters using scripts in hte
spark\ec2 directory

i launched the following command

./spark-ec2 -k sparkkey -i sparkAccessKey.pem -r us-west2 -s 4 launch
MM-Cluster


My output is stuck with the following (has been for the last 20 minutes)



i am running Spark on WIN 10, i have an AWS accout in us-west-2 region

Warning: SSH connection error. (This could be temporary.)
Host: 172.31.42.148
SSH return code: 255
SSH output: ssh: connect to host 172.31.42.148 port 22: Connection timed out

.

Warning: SSH connection error. (This could be temporary.)
Host: 172.31.42.148
SSH return code: 255
SSH output: ssh: connect to host 172.31.42.148 port 22: Connection timed out

.

Warning: SSH connection error. (This could be temporary.)
Host: 172.31.42.148
SSH return code: 255
SSH output: ssh: connect to host 172.31.42.148 port 22: Connection timed out

.

Warning: SSH connection error. (This could be temporary.)
Host: 172.31.42.148
SSH return code: 255
SSH output: ssh: connect to host 172.31.42.148 port 22: Connection timed out


from AWS console i can see that the master and slaves are in running status

could anyone assist?
kindest regards
 marco


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Also, re-reading the relevant part from the Structured Streaming
documentation (
https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
):
Discretized streams (aka dstream)

Unlike Storm, dstream exposes a higher level API similar to RDDs. There are
two main challenges with dstream:


   1.

   Similar to Storm, it exposes a monotonic system (processing) time
   metric, and makes support for event time difficult.
   2.

   Its APIs are tied to the underlying microbatch execution model, and as a
   result lead to inflexibilities such as changing the underlying batch
   interval would require changing the window size.


RQ addresses the above:


   1.

   RQ operations support both system time and event time.
   2.

   RQ APIs are decoupled from the underlying execution model. As a matter
   of fact, it is possible to implement an alternative engine that is not
   microbatch-based for RQ.
   3. In addition, due to the declarative specification of operations, RQ
   leverages a relational query optimizer and can often generate more
   efficient query plans.


This doesn't seem to attack the actual underlying implementation for how
things like "mapWithState" are going to be translated into RQ, and I think
thats the hole that's causing my misunderstanding.

On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov  wrote:

> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they
> do a light touch on infinite Dataframes/Datasets. However, they do not go
> in depth as regards to how existing transformations on DStreams, for
> example, will be transformed into the Dataset APIs. I've been browsing the
> 2.0 branch and have yet been able to understand how they correlate.
>
> Also, placing SparkSession in the sql package seems like a peculiar
> choice, since this is going to be the global abstraction over
> SparkContext/StreamingContext from now on.
>
> On Sun, May 15, 2016, 23:42 Ofir Manor  wrote:
>
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>> (merging of Dataset and Dataframe - which is why it inherits all the
>> SparkSQL goodness), while RDD seems as a low-level API only for special
>> cases. The new Dataset should also support both batch and streaming -
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>> However, as you noted, not all will be fully delivered in 2.0. For
>> example, it seems that streaming from / to Kafka using StructuredStreaming
>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>> Anyway, as far as I understand, you should be able to apply stateful
>> operators (non-RDD) on Datasets (for example, the new event-time window
>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>> examples will align with the current offering...
>>
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
>> wrote:
>>
>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>> which
>>> brings us Structured Streaming. One thing I've yet to understand is how
>>> this
>>> relates to the current state of working with Streaming in Spark with the
>>> DStream abstraction.
>>>
>>> All examples I can find, in the Spark repository/different videos is
>>> someone
>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>> browsing
>>> the source, SparkSession seems to be defined inside
>>> org.apache.spark.sql, so
>>> this gives me a hunch that this is somehow all related to SQL and the
>>> likes,
>>> and not really to DStreams.
>>>
>>> What I'm failing to understand is: Will this feature impact how we do
>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>> we
>>> be able to do state-full operations on a Dataset[T] like we do today
>>> using
>>> MapWithStateRDD? Or will there be a subset of operations that the
>>> catalyst
>>> optimizer can understand such as aggregate and such?
>>>
>>> I'd be happy anyone could shed some light on this.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: 

Re: Kafka stream message sampling

2016-05-16 Thread Mich Talebzadeh
Hi Samuel,

How do you create your RDD based on Kakfa direct stream?

Do you have your code snippet?

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 23:24, Samuel Zhou  wrote:

> Hi,
>
> I was trying to use filter to sampling a Kafka direct stream, and the
> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
> but the number of events of input for each batch didn't shrink to 10% of
> original traffic. So I want to ask if there are any way to shrink the batch
> size by a sampling function to save the traffic from Kafka?
>
> Thanks!
> Samuel
>


Re: Issue with Spark Streaming UI

2016-05-16 Thread Mich Talebzadeh
Have you check Streaming tab in Spark GUI?

[image: Inline images 1]

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 May 2016 at 07:26, Sachin Janani  wrote:

> Hi,
> I'm trying to run a simple spark streaming application with File Streaming
> and its working properly but when I try to monitor the number of events in
> the Streaming Ui it shows that as 0.Is this a issue and are there any plans
> to fix this.
>
>
> Regards,
> SJ
>


Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Mich Talebzadeh
Have you tried Spark GUI on 4040. This will show jobs being executed by
executors is each stage and the line of code as well.

[image: Inline images 1]

Also command line tools like jps and jmonitor

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 May 2016 at 06:25, Deepak Sharma  wrote:

> Hi
> I have scala program consisting of spark core and spark streaming APIs
> Is there any open source tool that i can use to debug the program for
> performance reasons?
> My primary interest is to find the block of codes that would be exeuted on
> driver and what would go to the executors.
> Is there JMX extension of Spark?
>
> --
> Thanks
> Deepak
>
>


Re: Executors and Cores

2016-05-16 Thread Mich Talebzadeh
Hi Pradeep,

Resources allocated for each Spark app can be capped to allow a balanced
resourcing for all apps. However, you really need to monitor each app.

One option would be to use jmonitor package to look at resource usage
(heap, CPU, memory etc) for each job.

In general you should not allocate too much for each job and FIFO is the
default scheduling.

If you are allocating resources then you need to cap it

${SPARK_HOME}/bin/spark-submit \

--master local[2] \

--driver-memory 4g \

--num-executors=1 \

--executor-memory=4G \

--executor-cores=2 \

…..

Don't over allocate resources as they will be wasted.


Spark GUI on 4040 can be useful but only displays the FIFO job picked up so
you wont see other jobs until the JVM that using Port 4040 is completed or
killed.


Start by identify Spark Jobs through jps. They will show up as SparkSubmit


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 13:19, Mail.com  wrote:

> Hi ,
>
> I have seen multiple videos on spark tuning which shows how to determine #
> cores, #executors and memory size of the job.
>
> In all that I have seen, it seems each job has to be given the max
> resources allowed in the cluster.
>
> How do we factor in input size as well? I am processing a 1gb compressed
> file then I can live with say 10 executors and not 21 etc..
>
> Also do we consider other jobs in the cluster that could be running? I
> will use only 20 GB out of available 300 gb etc..
>
> Thanks,
> Pradeep
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>