Re:Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
[root@ES01 test]# jps 10409 Master 12578 CoarseGrainedExecutorBackend 24089 NameNode 17705 Jps 24184 DataNode 10603 Worker 12420 SparkSubmit [root@ES01 test]# ps -awx | grep -i spark | grep java 10409 ?Sl 1:52 java -cp

Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
what does jps returning? jps 16738 ResourceManager 14786 Worker 17059 JobHistoryServer 12421 QuorumPeerMain 9061 RunJar 9286 RunJar 5190 SparkSubmit 16806 NodeManager 16264 DataNode 16138 NameNode 16430 SecondaryNameNode 22036 SparkSubmit 9557 Jps 13240 Kafka 2522 Master and ps -awx | grep -i

Re:Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread 李明伟
Hi Ted Spark version : spark-1.6.0-bin-hadoop2.6 I tried increase the memory of executor. Still have the same problem. I can use jmap to capture some thing. But the output is too difficult to understand. 在 2016-05-11 11:50:14,"Ted Yu" 写道: Which Spark release

How to resolve Scheduling delay in Spark streaming applications?

2016-05-10 Thread Hemalatha A
Hello, We are facing large Scheduling delay in our Spark streaming application. Not sure how to debug why the delay is happening. We have all the tuning possible on Spark side. Can someone advice how to debug the cause of the delay and some tips for resolving it please? -- Regards

Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu
Which Spark release are you using ? I assume executor crashed due to OOME. Did you have a chance to capture jmap on the executor before it crashed ? Have you tried giving more memory to the executor ? Thanks On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com wrote: > I

Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com
I submit my code to a spark stand alone cluster. Find the memory usage executor process keeps growing. Which cause the program to crash. I modified the code and submit several times. Find below 4 line may causing the issue dataframe =

What does the spark stand alone cluster do?

2016-05-10 Thread kramer2...@126.com
Hello. My question here is what the spark stand alone cluster do here. Because when we submit program like below ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" We specified the

Re:Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
Hi Mich From the ps command. I can find four process. 10409 is the master and 10603 is the worker. 12420 is the driver program and 12578 should be the executor (worker). Am I right? So you mean the 12420 is actually running both the driver and the worker role? [root@ES01 ~]# ps -awx | grep

Unable to write stream record to cassandra table with multiple columns

2016-05-10 Thread Anand N Ilkal
I am trying to write incoming stream data to database. Following is the example program, this code creates a thread to listen to incoming stream of data which is csv data. this data needs to be split with delimiter and the array of data needs to be pushed to database as separate columns in the

RE: Accessing Cassandra data from Spark Shell

2016-05-10 Thread Mohammed Guller
Yes, it is very simple to access Cassandra data using Spark shell. Step 1: Launch the spark-shell with the spark-cassandra-connector package $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0 Step 2: Create a DataFrame pointing to your Cassandra table

RE: Reading table schema from Cassandra

2016-05-10 Thread Mohammed Guller
You can create a DataFrame directly from a Cassandra table using something like this: val dfCassTable = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "your_column_family", "keyspace" -> "your_keyspace")).load() Then, you can get schema: val dfCassTableSchema

Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
hm, This is a standalone mode. When you are running Spark in Standalone mode, you only have one worker that lives within the driver JVM process that you start when you start spark-shell or spark-submit. However, since driver-memory setting encapsulates the JVM, you will need to set the amount

Spark 1.6 Catalyst optimizer

2016-05-10 Thread Telmo Rodrigues
Hello, I have a question about the Catalyst optimizer in Spark 1.6. initial logical plan: !'Project [unresolvedalias(*)] !+- 'Filter ('t.id = 1) ! +- 'Join Inner, Some(('t.id = 'u.id)) ! :- 'UnresolvedRelation `t`, None ! +- 'UnresolvedRelation `u`, None logical plan after

Re:Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
I actually provided them in submit command here: nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" ./mycode.py1>a.log 2>b.log & At 2016-05-10 21:19:06, "Mich Talebzadeh"

Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Never mind! I figured it out by saving it as hadoopfile and passing the codec to it. Thank you! On Tuesday, May 10, 2016, Ajay Chander wrote: > Hi, I have a folder temp1 in hdfs which have multiple format files > test1.txt, test2.avsc (Avro file) in it. Now I want to

Re: SparkSQL with large result size

2016-05-10 Thread Buntu Dev
Thanks Chris for pointing out the issue. I think I was able to get over this issue by: - repartitioning to increase the number of partitions (about 6k partitions) - apply sort() on the resulting dataframe to coalesce into single sorted partition file - read the sorted file and then adding just

Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Hi, I have a folder temp1 in hdfs which have multiple format files test1.txt, test2.avsc (Avro file) in it. Now I want to compress these files together and store it under temp2 folder in hdfs. Expecting that temp2 folder will have one file test_compress.gz which has test1.txt and test2.avsc under

Reliability of JMS Custom Receiver in Spark Streaming JMS

2016-05-10 Thread Sourav Mazumder
Hi, Need to get bit more understanding of reliability aspects of the Custom Receivers in the context of the code in spark-streaming-jms https://github.com/mattf/spark-streaming-jms. Based on the documentation in

Re: Save DataFrame to HBase

2016-05-10 Thread Ted Yu
I think so. Please refer to the table population tests in (master branch): hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/DefaultSourceSuite.scala Cheers On Tue, May 10, 2016 at 2:53 PM, Benjamin Kim wrote: > Ted, > > Will the hbase-spark module allow for

Re: Save DataFrame to HBase

2016-05-10 Thread Benjamin Kim
Ted, Will the hbase-spark module allow for creating tables in Spark SQL that reference the hbase tables underneath? In this way, users can query using just SQL. Thanks, Ben > On Apr 28, 2016, at 3:09 AM, Ted Yu wrote: > > Hbase 2.0 release likely would come after Spark

Re: Pyspark accumulator

2016-05-10 Thread Abi
On May 10, 2016 2:24:41 PM EDT, Abi wrote: >1. How come pyspark does not provide the localvalue function like scala >? > >2. Why is pyspark more restrictive than scala ?

Re: Accumulator question

2016-05-10 Thread Abi
On May 9, 2016 8:24:06 PM EDT, Abi wrote: >I am splitting an integer array in 2 partitions and using an >accumulator to sum the array. problem is > >1. I am not seeing execution time becoming half of a linear summing. > >2. The second node (from looking at

Re: pyspark mappartions ()

2016-05-10 Thread Abi
On May 10, 2016 2:20:25 PM EDT, Abi wrote: >Is there any example of this ? I want to see how you write the the >iterable example

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Xinh, Thanks! Custom partitioner with partitionBy() did the job. On Tue, May 10, 2016 at 11:36 PM, Xinh Huynh wrote: > Hi Ayman, > > Have you looked at this: >

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
Well, for Python, it should be rdd.coalesce(10, shuffle=True) I have had good success with this using the Scala API in Spark 1.6.1. -Don On Tue, May 10, 2016 at 3:15 PM, Ayman Khalil wrote: > And btw, I'm using the Python API if this makes any difference. > > On Tue,

Not able pass 3rd party jars to mesos executors

2016-05-10 Thread gpatcham
Hi All, I'm using --jars option in spark-submit to send 3rd party jars . But I don't see they are actually passed to mesos slaves. Getting Noclass found exceptions. This is how I'm using --jars option --jars hdfs://namenode:8082/user/path/to/jar Am I missing something here or what's the

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Xinh Huynh
Hi Ayman, Have you looked at this: http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where It recommends defining a custom partitioner and (PairRDD) partitionBy method to accomplish this. Xinh On Tue, May 10, 2016 at 1:15 PM,

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
And btw, I'm using the Python API if this makes any difference. On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil wrote: > Hi Don, > > This didn't help. My original rdd is already created using 10 partitions. > As a matter of fact, after trying with rdd.coalesce(10, shuffle

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hi Don, This didn't help. My original rdd is already created using 10 partitions. As a matter of fact, after trying with rdd.coalesce(10, shuffle = true) out of curiosity, the rdd partitions became even more imbalanced: [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,

Spark crashes with Filesystem recovery

2016-05-10 Thread Imran Akbar
I have some Python code that consistently ends up in this state: ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server Traceback (most recent call last): File "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be evenly balanced. This obviously triggers a shuffle, so be advised it could be an expensive operation depending on your RDD size. -Don On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil wrote: >

Hi test

2016-05-10 Thread Abi
Hello test

Pyspark accumulator

2016-05-10 Thread Abi
1. How come pyspark does not provide the localvalue function like scala ? 2. Why is pyspark more restrictive than scala ?

Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Hi Deepak, Thanks for your response. If I am correct, you suggest reading all of those files into an rdd on the cluster using wholeTextFiles then apply compression codec on it, save the rdd to another Hadoop cluster? Thank you, Ajay On Tuesday, May 10, 2016, Deepak Sharma

Re: Cluster Migration

2016-05-10 Thread Ajay Chander
I will try that out. Thank you! On Tuesday, May 10, 2016, Deepak Sharma wrote: > Yes that's what I intended to say. > > Thanks > Deepak > On 10 May 2016 11:47 pm, "Ajay Chander" > wrote: > >> Hi

pyspark mappartions ()

2016-05-10 Thread Abi
Is there any example of this ? I want to see how you write the the iterable example

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
Yes that's what I intended to say. Thanks Deepak On 10 May 2016 11:47 pm, "Ajay Chander" wrote: > Hi Deepak, >Thanks for your response. If I am correct, you suggest reading all > of those files into an rdd on the cluster using wholeTextFiles then apply > compression

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
Hi Ajay You can look at wholeTextFiles method of rdd[string,string] and then map each of rdd to saveAsTextFile . This will serve the purpose . I don't think if anything default like distcp exists in spark Thanks Deepak On 10 May 2016 11:27 pm, "Ajay Chander" wrote: > Hi

Cluster Migration

2016-05-10 Thread Ajay Chander
Hi Everyone, we are planning to migrate the data between 2 clusters and I see distcp doesn't support data compression. Is there any efficient way to compress the data during the migration ? Can I implement any spark job to do this ? Thanks.

Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hello, I have 50,000 items parallelized into an RDD with 10 partitions, I would like to evenly split the items over the partitions so: 50,000/10 = 5,000 in each RDD partition. What I get instead is the following (partition index, partition count): [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4,

Re: partitioner aware subtract

2016-05-10 Thread Raghava Mutharaju
Thank you for the response. This does not work on the test case that I mentioned in the previous email. val data1 = Seq((1 -> 2), (1 -> 5), (2 -> 3), (3 -> 20), (3 -> 16)) val data2 = Seq((1 -> 2), (3 -> 30), (3 -> 16), (5 -> 12)) val rdd1 = sc.parallelize(data1, 8) val rdd2 =

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
Pretty much the same problems you'd expect any time you have skew in a distributed system - some leaders are going to be working harder than others & have more disk space used, some consumers are going to be working harder than others. It sounds like you're talking about differences in topics,

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread chandan prakash
Hey Cody, What kind of problems exactly? ...data rate in kafka topics do vary significantly in my caseout of total 50 topics(with 3 partitions each),half of the topics generate data at very high speed say 1lakh/sec while other half generate at very low rate say 1k/sec... i have

Re: partitioner aware subtract

2016-05-10 Thread Rishi Mishra
As you have same partitioner and number of partitions probably you can use zipPartition and provide a user defined function to substract . A very primitive example being. val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7) val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6) val rdd1 =

Re: Spark-csv- partitionBy

2016-05-10 Thread Xinh Huynh
Hi Pradeep, Here is a way to partition your data into different files, by calling repartition() on the dataframe: df.repartition(12, $"Month") .write .format(...) This is assuming you want to partition by a "month" column where there are 12 different values. Each partition will be stored in

Re: Updating Values Inside Foreach Rdd loop

2016-05-10 Thread Rishi Mishra
Hi Harsh, Probably you need to maintain some state for your values, as you are updating some of the keys in a batch and check for a global state of your equation. Can you check the API mapWithState of DStream ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: Accumulator question

2016-05-10 Thread Rishi Mishra
Your mail does not describe much , but wont a simple reduce function help you ? Something like as below val data = Seq(1,2,3,4,5,6,7) val rdd = sc.parallelize(data, 2) val sum = rdd.reduce((a,b) => a+b) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Soumitra Johri
I think a better partitioning scheme can help u too. On Tue, May 10, 2016 at 10:31 AM Cody Koeninger wrote: > maxRate is not used by the direct stream. > > Significant skew in rate across different partitions for the same > topic is going to cause you all kinds of problems,

Re: Init/Setup worker

2016-05-10 Thread Natu Lauchande
Hi, Not sure if this might be helpful to you : https://github.com/ondra-m/ruby-spark . Regards, Natu On Tue, May 10, 2016 at 4:37 PM, Lionel PERRIN wrote: > Hello, > > > > I’m looking for a solution to use jruby on top of spark. The only tricky > point is that I

Init/Setup worker

2016-05-10 Thread Lionel PERRIN
Hello, I’m looking for a solution to use jruby on top of spark. The only tricky point is that I need that every worker thread has a ruby interpreter initialized. Basically, I need to register a function to be called when each worker thread is created : a thread local variable must be set for

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Cody Koeninger
maxRate is not used by the direct stream. Significant skew in rate across different partitions for the same topic is going to cause you all kinds of problems, not just with spark streaming. You can turn on backpressure, but you're better off addressing the underlying issue if you can. On Tue,

Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
Hi Mingwei, In your Spark conf setting what are you providing for these parameters. *Are you capping them?* For example val conf = new SparkConf(). setAppName("AppName"). setMaster("local[2]"). set("spark.executor.memory", "4G").

Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Soumitra Siddharth Johri
Also look at back pressure enabled. Both of these can be used to limit the rate Sent from my iPhone > On May 10, 2016, at 8:02 AM, chandan prakash > wrote: > > Hi, > I am using Spark Streaming with Direct kafka approach. > Want to limit number of event records

Re: best fit - Dataframe and spark sql use cases

2016-05-10 Thread Mathieu Longtin
Spark SQL is translated to DataFrame operations by the SQL engine. Use whichever is more comfortable for the task. Unless I'm doing something very straight forward, I go with SQL, since any improvement to the SQL engine will improve the resulting DataFrame operations. Hard-coded DataFrame

Re: spark 2.0 issue with yarn?

2016-05-10 Thread Steve Loughran
On 9 May 2016, at 21:24, Jesse F Chen > wrote: I had been running fine until builds around 05/07/2016 If I used the "--master yarn" in builds after 05/07, I got the following error...sounds like something jars are missing. I am using YARN

Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread chandan prakash
Hi, I am using Spark Streaming with Direct kafka approach. Want to limit number of event records coming in my batches. Have question regarding following 2 parameters : 1. spark.streaming.receiver.maxRate 2. spark.streaming.kafka.maxRatePerPartition The documentation (

Reading table schema from Cassandra

2016-05-10 Thread justneeraj
Hi, We are using Spark Cassandra connector for our app. And I am trying to create higher level roll up tables. e.g. minutes table from seconds table. If my tables are already defined. How can I read the schema of table? So that I can load them in the Dataframe and create the aggregates. Any

Re: Spark-csv- partitionBy

2016-05-10 Thread Mail.com
Hi, I don't want to reduce partitions. Should write files depending upon the column value. Trying to understand how reducing partition size will make it work. Regards, Pradeep > On May 9, 2016, at 6:42 PM, Gourav Sengupta wrote: > > Hi, > > its supported, try to

Re:Re: Re: spark uploading resource error

2016-05-10 Thread 朱旻
thanks! i solved the problem. spark-submit changed the HADOOP_CONF_DIR to spark/conf and was corrent but using java *... didn't change the HADOOP_CONF_DIR. it still use hadoop/etc/hadoop. At 2016-05-10 16:39:47, "Saisai Shao" wrote: The code is in Client.scala

Re: Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
The code is in Client.scala under yarn sub-module (see the below link). Maybe you need to check the vendor version about their changes to the Apache Spark code. https://github.com/apache/spark/blob/branch-1.3/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala Thanks Saisai On Tue,

Re: SparkSQL with large result size

2016-05-10 Thread Christophe Préaud
Hi, You may be hitting this bug: SPARK-9879 In other words: did you try without the LIMIT clause? Regards, Christophe. On 02/05/16 20:02, Gourav Sengupta wrote: Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and

Re:Re: spark uploading resource error

2016-05-10 Thread 朱旻
it was a product sold by huawei . name is FusionInsight. it says spark was 1.3 with hadoop 2.7.1 where can i find the code or config file which define the files to be uploaded? At 2016-05-10 16:06:05, "Saisai Shao" wrote: What is the version of Spark are you

Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
What is the version of Spark are you using? From my understanding, there's no code in yarn#client will upload "__hadoop_conf__" into distributed cache. On Tue, May 10, 2016 at 3:51 PM, 朱旻 wrote: > hi all: > I found a problem using spark . > WHEN I use spark-submit to

spark uploading resource error

2016-05-10 Thread 朱旻
hi all: I found a problem using spark . WHEN I use spark-submit to launch a task. it works spark-submit --num-executors 8 --executor-memory 8G --class com.icbc.nss.spark.PfsjnlSplit --master yarn-cluster /home/nssbatch/nss_schedual/jar/SparkBigtableJoinSqlJava.jar

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-10 Thread Johnny W.
Thanks, Ashish. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-15247 Best, J. On Sun, May 8, 2016 at 7:07 PM, Ashish Dubey wrote: > I see the behavior - so it always goes with min total tasks possible on > your settings ( num-executors * num-cores ) -

Pyspark with non default hive table

2016-05-10 Thread ayan guha
Hi Can we write to non default hive table using pyspark?