What would happen when reduce memory is not enough on spark shuffle read stage?

2015-11-09 Thread JoneZhang
for example if the data size of shuffle read is 1T and the number of reduce is 200, it means one reduce needs to fetch 1T/200=5G data. if the total memory of one reduce is only 4G, what would happen? -- View this message in context:

Re: [SPARK STREAMING ] Sending data to ElasticSearch

2015-11-09 Thread Akhil Das
Have a look at https://github.com/elastic/elasticsearch-hadoop#apache-spark You can simply call the .saveToEs function to store your RDD data into ES. Thanks Best Regards On Thu, Oct 29, 2015 at 8:19 PM, Nipun Arora wrote: > Hi, > > I am sending data to an

Re: heap memory

2015-11-09 Thread Akhil Das
Its coming from parquet , you can try increasing your driver memory and see if its still coming. Thanks Best Regards On Fri, Oct 30, 2015 at 7:16 PM, Younes Naguib <

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread Tathagata Das
You cannot submit from eclipse to a cluster that easily. You can run locally (master set to local...), and it should work with just the pom. On Mon, Nov 9, 2015 at 2:49 AM, أنس الليثي wrote: > If I packaged the application and submit it, it works fine but I need to > run

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Yes, that’s the problem. http://search.maven.org/#artifactdetails%7Ccom.twitter%7Cparquet-avro%7C1.6.0%7Cjar this depends on parquet-hadoop-1.6.0, then triggered this bug. can you change the version to 1.6.0rc7

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread swetha kasireddy
I am using the following: com.twitter parquet-avro 1.6.0 On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu wrote: > Which Spark version used? > > It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > > > > > > On Nov 9, 2015, at 3:43 PM, swetha

Re: Issue on spark.driver.maxResultSize

2015-11-09 Thread Akhil Das
You can set it in your conf/spark-defaults.conf file, or you will have to set it before you create the SparkContext. Thanks Best Regards On Fri, Oct 30, 2015 at 4:31 AM, karthik kadiyam < karthik.kadiyam...@gmail.com> wrote: > Hi, > > In spark streaming job i had the following setting > >

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Akhil Das
Did you go through http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup for yarn, i guess you will have to copy the spark-1.5.1-yarn-shuffle.jar to the classpath of all nodemanagers in your cluster. Thanks Best Regards On Fri, Oct 30, 2015 at 7:41 PM, Tom Stewart <

Re: How to properly read the first number lines of file into a RDD

2015-11-09 Thread Akhil Das
​There's multiple way to achieve this: 1. Read the N lines from the driver and then do a sc.parallelize(nlines) to create an RDD out of it. 2. Create an RDD with N+M, do a take on N and then broadcast or parallelize the returning list. 3. Something like this if the file is in hdfs: val n_f =

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread DW @ Gmail
Try to reproduce what the spark-submit shell script does, setting up the class path etc. Sent from my rotary phone. > On Nov 9, 2015, at 7:07 AM, Tathagata Das wrote: > > You cannot submit from eclipse to a cluster that easily. You can run locally > (master set to

Re: Spark IndexedRDD dependency in Maven

2015-11-09 Thread Ted Yu
I would suggest asking this question on SPARK-2365 since IndexedRDD has not been released (upstream) Cheers On Mon, Nov 9, 2015 at 1:34 PM, swetha wrote: > > Hi , > > What is the appropriate dependency to include for Spark Indexed RDD? I get > compilation error if I

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Marcelo Vanzin
We've had this in the past when using "@VisibleForTesting" in classes that for some reason the shell tries to process. QueryExecution.scala seems to use that annotation and that was added recently, so that's probably the issue. BTW, if anyone knows how Scala can find a reference to the original

Is it possible Running SparkR on 2 nodes without HDFS

2015-11-09 Thread Sanjay Subramanian
hey guys I have a 2 node SparkR (1 master 1 slave)cluster on AWS using  spark-1.5.1-bin-without-hadoop.tgz Running the SparkR job on the master node  /opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master   spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0   --executor-cores

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Marcelo Vanzin
On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu wrote: > If there is no option to let shell skip processing @VisibleForTesting , > should the annotation be dropped ? That's what we did last time this showed up. > On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Michael Armbrust
Yeah, we should probably remove that. On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu wrote: > If there is no option to let shell skip processing @VisibleForTesting , > should the annotation be dropped ? > > Cheers > > On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin

Re: kryos serializer

2015-11-09 Thread Hitoshi Ozawa
This is a little bit old thread but in case other users may still want to know the answer, check the following page. The property is set in conf/spark-env.sh http://arjon.es/2014/04/14/how-to-change-default-serializer-on-apache-spark-shell/ -- View this message in context:

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi, Have you ever considered cassandra as a replacement ? We are now almost the seem usage as your engine, e.g. using mysql to store initial aggregated data. Can you share more about your kind of Cube queries ? We are very interested in that arch too : ) Best, Sun. fightf...@163.com

could not see the print out log in spark functions as mapPartitions

2015-11-09 Thread Zhiliang Zhu
Hi All, I need debug spark job, my general way is to print out the log, however, some bug is in spark functions as mapPartitions etc, and not any log printed from those functionscould be found...Would you help point what is way to the log in the spark own function as mapPartitions? Or, what is

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Xiu Guo
Hi Zhan: I hit the exact problem you hit. I rolled back to commit: de289bf279e14e47859b5fbcd70e97b9d0759f14 which does not have this problem. I suspect something delivered in the past 4 days caused this problem. On Mon, Nov 9, 2015 at 12:20 PM Ted Yu wrote: > I

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Josh Rosen
When we remove this, we should add a style-checker rule to ban the import so that it doesn't get added back by accident. On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust wrote: > Yeah, we should probably remove that. > > On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Ted Yu
Created https://github.com/apache/spark/pull/9585 Cheers On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen wrote: > When we remove this, we should add a style-checker rule to ban the import > so that it doesn't get added back by accident. > > On Mon, Nov 9, 2015 at 6:13 PM,

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Ted Yu
If there is no option to let shell skip processing @VisibleForTesting , should the annotation be dropped ? Cheers On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin wrote: > We've had this in the past when using "@VisibleForTesting" in classes > that for some reason the shell

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi, According to my experience, I would recommend option 3) using Apache Kylin for your requirements. This is a suggestion based on the open-source world. For the per cassandra thing, I accept your advice for the special support thing. But the community is very open and convinient for

Re: could not see the print out log in spark functions as mapPartitions

2015-11-09 Thread Deng Ching-Mallete
Hi Zhiliang, You should be able to see them in the executor logs, which you can view via the Spark UI, in the Executors page (stderr log). HTH, Deng On Tue, Nov 10, 2015 at 11:33 AM, Zhiliang Zhu wrote: > Hi All, > > I need debug spark job, my general way is to

Re: could not see the print out log in spark functions as mapPartitions

2015-11-09 Thread Zhiliang Zhu
Dear Ching-Mallete , There are machines master01, master02 and master03 for the cluster, I could see the stderr log and stdout log in the directories /.../yarn/logs under those machines.In the stderr there are some notes, I just use log to follow the track of the bug, however log in

Re: could not see the print out log in spark functions as mapPartitions

2015-11-09 Thread Zhiliang Zhu
Also for Spark UI , that  is, log from other places could be found, but the log from the functions as mapPartitions could not. On Tuesday, November 10, 2015 11:52 AM, Zhiliang Zhu wrote: Dear Ching-Mallete , There are machines master01, master02 and master03

Re: streaming: missing data. does saveAsTextFile() append or replace?

2015-11-09 Thread Andy Davidson
Thank Gerard I¹ll give that a try. It seems like this approach is going to create a very large number of files. I guess I could write a cron job to concatenate the files by hour or maybe days. I imagine this is a common problem. Do you know of something that does this already ? I am using the

Re: Spark Streaming updateStateByKey Implementation

2015-11-09 Thread Hien Luu
Thanks Zoltan. I will take a look at StateDStream.scala On Sun, Nov 8, 2015 at 2:42 AM, Zoltán Zvara wrote: > It is implemented with cogroup. Basically it stores states in a separate > RDD and cogroups the target RDD with the state RDD, which is then hidden > from you.

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Zhan Zhang
Thanks Ted. I am using latest master branch. I will try your build command and give it a try. Thank. Zhan Zhang On Nov 9, 2015, at 10:46 AM, Ted Yu > wrote: Which branch did you perform the build with ? I used the following command yesterday:

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Ted Yu
Which branch did you perform the build with ? I used the following command yesterday: mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests Spark shell was working. Building with latest master branch. On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
The direct stream will fail the task if there is a problem with the kafka broker. Spark will retry failed tasks automatically, which should handle broker rebalances that happen in a timely fashion. spark.tax.maxFailures controls the maximum number of retries before failing the job. Direct stream

Anybody hit this issue in spark shell?

2015-11-09 Thread Zhan Zhang
Hi Folks, Does anybody meet the following issue? I use "mvn package -Phive -DskipTests” to build the package. Thanks. Zhan Zhang bin/spark-shell ... Spark context available as sc. error: error while loading QueryExecution, Missing dependency 'bad symbolic reference. A signature in

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Did you find anything regarding the OOM in the executor logs? Thanks Best Regards On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman wrote: > If they have a problem managing memory, wouldn't there should be a OOM? > Why does AppClient throw a NPE? > > *Romi Kuntsman*, *Big Data

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
If they have a problem managing memory, wouldn't there should be a OOM? Why does AppClient throw a NPE? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das wrote: > Is that all you have in the executor logs? I

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
I don't think deleting the checkpoint directory is a good way to restart the streaming job, you should stop the spark context or at the very least kill the driver process, then restart. On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy wrote: > Hi Cody, > > Our job is

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
OK. But, one thing that I observed is that when there is a problem with Kafka Stream, unless I delete the checkpoint directory the Streaming job does not restart. I guess it tries to retry the failed tasks and if it's not able to recover, it fails again. Sometimes, it fails with StackOverFlow

Re: status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
further reading code of MasterPage gave me what I want: http://:8080/json returns json view of all info presented in main page On 9 November 2015 at 22:41, Igor Berman wrote: > Hi, > How do I get status of workers(slaves) from driver? > why I need it - I want to autoscale

status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
Hi, How do I get status of workers(slaves) from driver? why I need it - I want to autoscale new workers and want to poll status of cluster(e.g. number of alive slaves connected) so that I'll submit job only after expected number of slaves joined cluster I've found MasterPage class that produces

Re: Protobuff 3.0 for Spark

2015-11-09 Thread Lan Jiang
I have not run into any linkage problem, but maybe I was lucky. :-). The reason I wanted to use protobuf 3 is mainly for Map type support. On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran wrote: > > > On 5 Nov 2015, at 00:12, Lan Jiang wrote: > > > > I

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Ted Yu
I backtracked to: ef362846eb448769bcf774fc9090a5013d459464 The issue was still there. FYI On Mon, Nov 9, 2015 at 10:46 AM, Ted Yu wrote: > Which branch did you perform the build with ? > > I used the following command yesterday: > mvn -Phive -Phive-thriftserver -Pyarn

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
Without knowing more about what's being stored in your checkpoint directory / what the log output is, it's hard to say. But either way, just deleting the checkpoint directory probably isn't sufficient to restart the job... On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy

Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Luke Han
Some friends refer me this thread about OLAP/Kylin and Spark... Here's my 2 cents.. If you are trying to setup OLAP, Apache Kylin should be one good idea for you to evaluate. The project has developed more than 2 years and going to graduate to Apache Top Level Project [1]. There are many

Re: could not see the print out log in spark functions as mapPartitions

2015-11-09 Thread Zhiliang Zhu
Hi Ching-Mallete, I  have found the log and the reason for that. Thanks a lot!Zhiliang  On Tuesday, November 10, 2015 12:23 PM, Zhiliang Zhu wrote: Also for Spark UI , that  is, log from other places could be found, but the log from the functions as

Re: java.lang.NoSuchMethodError: org.apache.spark.ui.SparkUI.addStaticHandler(Ljava/lang/String;Ljava/lang/String;

2015-11-09 Thread Hitoshi Ozawa
I think that example is included with Spark. The source code is included in examples/src/main/java/org/apache/spark/examples/sql It can be execute with the following command: ./bin/run-example org.apache.spark.examples.sql.JavaSparkSQL -- View this message in context:

Why is Kryo not the default serializer?

2015-11-09 Thread Hitoshi Ozawa
If Kryo usage is recommended, why is Java serialization the default serializer instead of Kryo? Is there some limitation to using Kryo? I've read through the documentation but it just seem Kryo is a better choice and should be made a default. -- View this message in context:

Re: Why is Kryo not the default serializer?

2015-11-09 Thread Sabarish Sasidharan
I have seen some failures in our workloads with Kryo, one I remember is a scenario with very large arrays. We could not get Kryo to work despite using the different configuration properties. Switching to java serde was what worked. Regards Sab On Tue, Nov 10, 2015 at 11:43 AM, Hitoshi Ozawa

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread أنس الليثي
If I packaged the application and submit it, it works fine but I need to run it from eclipse. Is there any problem running the application from eclipse ? On 9 November 2015 at 12:27, Tathagata Das wrote: > How are you submitting the spark application? > You are supposed

Batch Recovering from Checkpoint is taking longer runtime than usual

2015-11-09 Thread kundan kumar
Hi, Below my code snippet where I am using checkpointing feature of spark streaming. The SPARK_DURATION that I am using is 5 minutes and the batch duration is 15 minutes. I am checkpointing the data at each SPARK_DURATION (5 minutes). When I kill the job and start the next batch it takes

shapely + pyspark

2015-11-09 Thread ikeralbeniz
Hi All, I am using shapely library to get the intersection between two areas. The problem is that when i try to run it on pyspark i get a serializer error because shapely depend on GEOS c library. Is there any way to use this kind of libraries on spark with pyspark. Regards. -- View this

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
I store some metrics and the RDD which is the output of updateStateByKey in my checkpoint directory. Will retest and check for the error that I get. But, it's mostly the StackOverFlowError that I get. So, increasing the Stack size might help? On Mon, Nov 9, 2015 at 12:45 PM, Cody Koeninger

Spark IndexedRDD dependency in Maven

2015-11-09 Thread swetha
Hi , What is the appropriate dependency to include for Spark Indexed RDD? I get compilation error if I include 0.3 as the version as shown below: amplab spark-indexedrdd 0.3 Thanks, Swetha -- View this message in context:

Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Andrés Ivaldi
Hi, I'm also considering something similar, Spark plain is too slow for my case, a possible solution is use Spark as Multiple Source connector and basic transformation layer, then persist the information (actually is a RDBM), after that with our engine we build a kind of Cube queries, and the

Re: Overriding Derby in hive-site.xml giving strange results...

2015-11-09 Thread Michael Armbrust
We have two copies of hive running in order to support multiple versions of hive with a single version of Spark. You are see log messages for the version that we use for execution (it just creates a temporary derby metastore). On Mon, Nov 9, 2015 at 3:32 PM, mayurladwa

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Which Spark version used? It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > On Nov 9, 2015, at 3:43 PM, swetha wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning

First project in scala IDE : first problem

2015-11-09 Thread didier vila
All, This is my first run with scala and maven on spark using scala IDE on my single computer. I have the following problem. Thanks by advance Didier Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties15/11/09 23:30:52 INFO SparkContext: Running Spark version

RE: First project in scala IDE : first problem

2015-11-09 Thread didier vila
All, I identified the reason of my problem. Regards. D From: viladid...@hotmail.com To: user@spark.apache.org Subject: First project in scala IDE : first problem Date: Mon, 9 Nov 2015 23:39:55 + All, This is my first run with scala and maven on spark using scala IDE on my single

spark shared RDD

2015-11-09 Thread Ben
Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am

Overriding Derby in hive-site.xml giving strange results...

2015-11-09 Thread mayurladwa
Hello, this question is around the hive thrift server that comes packaged with spark 1.5.1, I am trying to change the default metastore from derby. >From googling I see the more commonly documented alternatives to derby are MySQL, but as it exposes a JDBC interface I want to try and get this

Clustering of Words

2015-11-09 Thread pradhandeep
Hi, I am trying to cluster words of some articles. I used TFIDF and Word2Vec in Spark to get the vector for each word and I used KMeans to cluster the words. Now, is there any way to get back the words from the vectors? I want to know what words are there in each cluster. I am aware that TFIDF

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Ted Yu
Please see https://issues.apache.org/jira/browse/PARQUET-124 > On Nov 8, 2015, at 11:43 PM, swetha wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning message. Any idea as to how

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread Tathagata Das
How are you submitting the spark application? You are supposed to submit the fat-jar of the application that include the spark-streaming-twitter dependency (and its subdeps) but not spark-streaming and spark-core. On Mon, Nov 9, 2015 at 1:02 AM, أنس الليثي wrote: > I

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread أنس الليثي
I tried to remove maven and adding the dependencies manually using build path > configure build path > add external jars, then adding the jars manually but it did not work. I tried to create another project and copied the code from the first app but the problem still the same. I event tried to

Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-09 Thread aecc
Any help on this? this is really blocking me and I don't find any feasible solution yet. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25327.html Sent from the Apache Spark User List

Re: Unwanted SysOuts in Spark Parquet

2015-11-09 Thread Hitoshi Ozawa
I'm not sure if following will work with Parquet output but have you tried setting sc.setLogLevel("ERROR") or setting log levels in spark's log4j.properties file? -- View this message in context:

Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Ted Yu
Please consider using NoSQL engine such as hbase. Cheers > On Nov 9, 2015, at 3:03 PM, Andrés Ivaldi wrote: > > Hi, > I'm also considering something similar, Spark plain is too slow for my case, > a possible solution is use Spark as Multiple Source connector and basic >