Re: Some questions on Multiple Streams

2015-04-24 Thread Laeeq Ahmed
Hi, Any comments please. Regards,Laeeq On Friday, April 17, 2015 11:37 AM, Laeeq Ahmed laeeqsp...@yahoo.com.INVALID wrote: Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the

Spark Cluster Setup

2015-04-24 Thread James King
I'm trying to find out how to setup a resilient Spark cluster. Things I'm thinking about include: - How to start multiple masters on different hosts? - there isn't a conf/masters file from what I can see Thank you.

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
Hi, My data looks like this: +---++--+ | col_name | data_type | comment | +---++--+ | cust_id | string | | | part_num | int|

Re: Some questions on Multiple Streams

2015-04-24 Thread Iulian Dragoș
It looks like you’re creating 23 actions in your job (one per DStream). As far as I know by default Spark Streaming executes only one job at a time. So your 23 actions are executed one after the other. Try setting spark.streaming.concurrentJobs to something higher than one. iulian ​ On Fri, Apr

Re: spark-ec2 s3a filesystem support and hadoop versions

2015-04-24 Thread Steve Loughran
S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that as the person who mentored in all the patches for it between Hadoop 2.6 2.7 you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in your code -Hadoop 2.6.0 doesn't have any of the HADOOP-11571

RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov
# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task You can provide the number of p -Original Message-

Re: Some questions on Multiple Streams

2015-04-24 Thread Dibyendu Bhattacharya
You can probably try the Low Level Consumer from spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . How many partitions are there for your topics ? Let say you have 10 topics , and each having 3 partition , ideally you can create max 30 parallel Receiver and 30

RE: Slower performance when bigger memory?

2015-04-24 Thread Evo Eftimov
You can resort to Serialized storage (still in memory) of your RDDs - this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap (most likely in Tachion which is distributed in memory files system used by Spark internally) Also review the Object

Re: Understanding Spark/MLlib failures

2015-04-24 Thread Hoai-Thu Vuong
Hi Andrew, according to you we should balance the time when gc run and the batch time, which rdd is processed? On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh r...@databricks.com wrote: Hi Andrew, The .principalComponents feature of RowMatrix is currently constrained to tall and skinny matrices.

Re: A Spark Group by is running forever

2015-04-24 Thread Iulian Dragoș
On Thu, Apr 23, 2015 at 6:09 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have seen multiple blogs stating to use reduceByKey instead of groupByKey. Could someone please help me in converting below code to use reduceByKey Code some spark processing ... Below val

what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread sequoiadb
If I run spark in stand-alone mode ( not YARN mode ), is there any tool like Sqoop that able to transfer data from RDBMS to spark storage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread ayan guha
What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Hoai-Thu Vuong
I use sudo pip install ... for each machine in cluster. And don't think how submit library On Fri, Apr 24, 2015 at 4:21 AM dusts66 dustin.davids...@gmail.com wrote: I am trying to figure out python library management. So my question is: Where do third party Python libraries(ex. numpy, scipy,

Re: JAVA_HOME problem

2015-04-24 Thread sourabh chaki
Yes Akhil. This is the same issue. I have updated my comment in that ticket. Thanks Sourabh On Fri, Apr 24, 2015 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Isn't this related to this https://issues.apache.org/jira/browse/SPARK-6681 Thanks Best Regards On Fri, Apr 24, 2015

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Spico Florin
Hello! I know that HadoopRDD partitions are built based on the number of splits in HDFS. I'm wondering if these partitions preserve the initial order of data in file. As an example, if I have an HDFS (myTextFile) file that has these splits: split 0- line 1, ..., line k split 1-line k+1,...,

Re: Slower performance when bigger memory?

2015-04-24 Thread Shawn Zheng
this is not about gc issue itself. The memory is On Friday, April 24, 2015, Evo Eftimov evo.efti...@isecc.com wrote: You can resort to Serialized storage (still in memory) of your RDDs – this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
zipwithIndex will preserve the order whatever is there in your val lines. I am not sure about the val lines=sc.textFile(hdfs://mytextFile) if this line maintain the order, next will maintain for sure On 24 April 2015 at 18:35, Spico Florin spicoflo...@gmail.com wrote: Hello! I know that

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I did a quick test as I was curious about it too. I created a file with numbers from 0 to 999, in order, line by line. Then I did: scala val numbers = sc.textFile(./numbers.txt) scala val zipped = numbers.zipWithUniqueId scala zipped.foreach(i = println(i)) Expected result if the order was

Spark RDD sortByKey triggering a new job

2015-04-24 Thread Spico Florin
I have tested sortByKey method with the following code and I have observed that is triggering a new job when is called. I could find this in the neither in API nor in the code. Is this an indented behavior? For example, the RDD zipWithIndex method API specifies that will trigger a new job. But

Re: Spark RDD sortByKey triggering a new job

2015-04-24 Thread Sean Owen
Yes, I think this is a known issue, that sortByKey actually runs a job to assess the distribution of the data. https://issues.apache.org/jira/browse/SPARK-1021 I think further eyes on it would be welcome as it's not desirable. On Fri, Apr 24, 2015 at 9:57 AM, Spico Florin spicoflo...@gmail.com

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Ted Yu
Yin: Fix Version of SPARK-4520 is not set. I assume it was fixed in 1.3.0 Cheers Fix Version On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote: The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark?

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
Sorry for my explanation, my English is bad. I just need obtain the Long containing of the DStream created by messages.count(). Thanks for all. 2015-04-24 20:00 GMT+02:00 Sean Owen so...@cloudera.com: Do you mean an RDD? I don't think it makes sense to ask if the DStream has data; it may have

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
oh, I missed that. It is fixed in 1.3.0. Also, Jianshi, the dataset was not generated by Spark SQL, right? On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu yuzhih...@gmail.com wrote: Yin: Fix Version of SPARK-4520 is not set. I assume it was fixed in 1.3.0 Cheers Fix Version On Fri, Apr 24,

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
But if a use messages.count().print this show a single number :/ 2015-04-24 20:22 GMT+02:00 Sean Owen so...@cloudera.com: It's not a Long. it's an infinite stream of Longs. On Fri, Apr 24, 2015 at 2:20 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: It isn't the sum. This is de

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
It's mostly manual. You could try automating with something like Chef, of course, but there's nothing already available in terms of automation. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: How to debug Spark on Yarn?

2015-04-24 Thread Marcelo Vanzin
On top of what's been said... On Wed, Apr 22, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: 1) I can go to Spark UI and see the status of the APP but cannot see the logs as the job progresses. How can i see logs of executors as they progress ? Spark 1.3 should have links to the

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
The sum? you just need to use an accumulator to sum the counts or something. On Fri, Apr 24, 2015 at 2:14 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Sorry for my explanation, my English is bad. I just need obtain the Long containing of the DStream created by messages.count().

Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
For #1, click on a worker node on the YARN dashboard. From there, Tools-Local logs-Userlogs has the logs for each application, and you can view them by executor even while an application is running. (This is for Hadoop 2.4, things may have changed in 2.6.) -Sven On Thu, Apr 23, 2015 at 6:27 AM,

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
No, it prints each Long in that stream, forever. Have a look at the DStream API. On Fri, Apr 24, 2015 at 2:24 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: But if a use messages.count().print this show a single number :/

Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com wrote: Spark 1.3 should have links to the executor logs in the UI while the application is running. Not yet in the history server, though. You're absolutely correct -- didn't notice it until now. This is a great addition!

Disable partition discovery

2015-04-24 Thread cosmincatalin
How can one disable *Partition discovery* in *Spark 1.3.0 * when using *sqlContext.parquetFile*? Alternatively, is there a way to load /.parquet/ files without *Partition discovery*? - https://www.linkedin.com/in/cosmincatalinsanda -- View this message in context:

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Sean Owen
The order of elements in an RDD is in general not guaranteed unless you sort. You shouldn't expect to encounter the partitions of an RDD in any particular order. In practice, you probably find the partitions come up in the order Hadoop presents them in this case. And within a partition, in this

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
If you're reading a file one by line then you should simply use Java's Hadoop FileSystem class to read the file with a BuffereInputStream. I don't think you need an RDD here. Sent with Good (www.good.com) -Original Message- From: Michal Michalski

Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai
Hi Sergio, I missed this thread somehow... For the error case classes cannot have more than 22 parameters., it is the limitation of scala (see https://issues.scala-lang.org/browse/SI-7296). You can follow the instruction at

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
Thanks that's why I was worried and tested my application again :). On 24 April 2015 at 23:22, Michal Michalski michal.michal...@boxever.com wrote: Yes. Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 17:12, Jeetendra Gangele gangele...@gmail.com wrote:

Re: regarding ZipWithIndex

2015-04-24 Thread Jeetendra Gangele
Anyone who can guide me how to reduce the Size from Long to Int since I dont need Long index. I am huge data and this index talking 8 bytes, if i can reduce it to 4 bytes will be great help? On 22 April 2015 at 22:46, Jeetendra Gangele gangele...@gmail.com wrote: Sure thanks. if you can guide

contributing code - how to test

2015-04-24 Thread Deborah Siegel
Hi, I selected a starter task in JIRA, and made changes to my github fork of the current code. I assumed I would be able to build and test. % mvn clean compile was fine but %mvn package failed [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test)

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Reza, I’m trying to identify groups of similar variables, with the ultimate goal of reducing the dimensionality of the dataset. I believe SVD would be sufficient for this, although I also tried running RowMatrix.computeSVD and observed the same behavior: frequent task failures, with

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Burak, Thanks for this insight. I’m curious to know, how did you reach the conclusion that GC pauses were to blame? I’d like to gather some more diagnostic information to determine whether or not I’m facing a similar scenario. ~ Andrew From: Burak Yavuz [mailto:brk...@gmail.com] Sent:

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread Ali Bajwa
Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has already been answered... Am trying to use the data frames API in python to join 2

Re: Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Use Object[] in Java just works :). On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayListObject something; Row.create(something) will create a row

Re: contributing code - how to test

2015-04-24 Thread Sean Owen
The standard incantation -- which is a little different from standard Maven practice -- is: mvn -DskipTests [your options] clean package mvn [your options] test Some tests require the assembly, so you have to do it this way. I don't know what the test failures were, you didn't post them, but

RE: indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
Hi, I may need to read many values. The list [0,4,5,6,8] is the locations of the rows I’d like to extract from the RDD (of labledPoints). Could you possibly provide a quick example? Also, I’m not quite sure how this work, but the resulting RDD should be a clone, as I may need to modify the

Re: StreamingContext.textFileStream issue

2015-04-24 Thread Prannoy
Try putting files with different file name and see if the stream is able to detect them. On 25-Apr-2015 3:02 am, Yang Lei [via Apache Spark User List] ml-node+s1001560n22650...@n3.nabble.com wrote: I hit the same issue as if the directory has no files at all when running the sample

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Christian Perez
To run MLlib, you only need numpy on each node. For additional dependencies, you can call the spark-submit with --py-files option and add the .zip or .egg. https://spark.apache.org/docs/latest/submitting-applications.html Cheers, Christian On Fri, Apr 24, 2015 at 1:56 AM, Hoai-Thu Vuong

ORCFiles

2015-04-24 Thread David Mitchell
Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
The problem I'm facing is that I need to process lines from input file in the order they're stored in the file, as they define the order of updates I need to apply on some data and these updates are not commutative so that order matters. Unfortunately the input is purely order-based, theres no

[Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-24 Thread Peter Rudenko
Hi i have a next problem. I have a dataset with 30 columns (15 numeric, 15 categorical) and using ml transformers/estimators to transform each column (StringIndexer for categorical MeanImputor for numeric). This creates 30 more columns in a dataframe. After i’m using VectorAssembler to

Re: Spark Cluster Setup

2015-04-24 Thread James King
Thanks Dean, Sure I have that setup locally and testing it with ZK. But to start my multiple Masters do I need to go to each host and start there or is there a better way to do this. Regards jk On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler deanwamp...@gmail.com wrote: The convention for

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
you used ZipWithUniqueID? On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com wrote: I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean - I saw it before, but I just thought it's not doing what I want. I've re-read the description now and it looks

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Imran Rashid
Another issue is that hadooprdd (which sc.textfile uses) might split input files and even if it doesn't split, it doesn't guarantee that part files numbers go to the corresponding partition number in the rdd. Eg part-0 could go to partition 27 On Apr 24, 2015 7:41 AM, Michal Michalski

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
The convention for standalone cluster is to use Zookeeper to manage master failover. http://spark.apache.org/docs/latest/spark-standalone.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean - I saw it before, but I just thought it's not doing what I want. I've re-read the description now and it looks like it might be actually what I need. Thanks. Kind regards, Michał Michalski, michal.michal...@boxever.com On

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I'd prefer to avoid preparing the file in advance by adding ordinals before / after each line I mean - I want to avoid doing it outside of spark of course. That's why I want to achieve the same effect with Spark by reading the file as single partition and zipping it with unique id which - I hope

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
I have an RDDObject which I get from Hbase scan using newAPIHadoopRDD. I am running here ZipWithIndex and its preserving the order. first object got 1 second got 2 third got 3 and so on nth object got n. On 24 April 2015 at 20:56, Ganelin, Ilya ilya.gane...@capitalone.com wrote: To maintain

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
Of course after you do it, you probably want to call repartition(somevalue) on your RDD to get your paralellism back. Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 15:28, Michal Michalski michal.michal...@boxever.com wrote: I did a quick test as I was curious

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
Michael - you need to sort your RDD. Check out the shuffle documentation on the Spark Programming Guide. It talks about this specifically. You can resolve this in a couple of ways - either by collecting your RDD and sorting it, using sortBy, or not worrying about the internal ordering. You can

Re: ORCFiles

2015-04-24 Thread Ted Yu
Please see SPARK-2883 There is no Fix Version yet. On Fri, Apr 24, 2015 at 5:45 PM, David Mitchell jdavidmitch...@gmail.com wrote: Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
can you give an example set of data and desired output On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote: Hi, I would like to answer the following customized aggregation query on Spark SQL 1. Group the table by the value of Name 2. For each group, choose the tuple with

Re: Number of input partitions in SparkContext.sequenceFile

2015-04-24 Thread Wenlei Xie
Hi, I checked the number of partitions by System.out.println(INFO: RDD with + rdd.partitions().size() + partitions created.); Each single split is about 100MB. I am currently loading the data from local file system, would this explains this observation? Thank you! Best, Wenlei On Tue, Apr

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested your pr On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote: Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Setting SPARK_CLASSPATH is triggering other errors. Not working. On 25 April 2015 at 09:16, Manku Timma manku.tim...@gmail.com wrote: Actually found the culprit. The JavaSerializerInstance.deserialize is called with a classloader (of type MutableURLClassLoader) which has access to all the

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
Here you go t = [[A,10,A10],[A,20,A20],[A,30,A30],[B,15,B15],[C,10,C10],[C,20,C200]] TRDD = sc.parallelize(t).map(lambda t: Row(name=str(t[0]),age=int(t[1]),other=str(t[2]))) TDF = ssc.createDataFrame(TRDD) print TDF.printSchema() TDF.registerTempTable(tab) JN =

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Actually found the culprit. The JavaSerializerInstance.deserialize is called with a classloader (of type MutableURLClassLoader) which has access to all the hive classes. But internally it triggers a call to loadClass but with the default classloader. Below is the stacktrace (line numbers in the

Customized Aggregation Query on Spark SQL

2015-04-24 Thread Wenlei Xie
Hi, I would like to answer the following customized aggregation query on Spark SQL 1. Group the table by the value of Name 2. For each group, choose the tuple with the max value of Age (the ages are distinct for every name) I am wondering what's the best way to do it on Spark SQL? Should I use

Re: Convert DStream to DataFrame

2015-04-24 Thread Sergio Jiménez Barrio
Solved! I have solved the problem combining both solutions. The result is this: messages.foreachRDD { rdd = val message: RDD[String] = rdd.map { y = y._2 } val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext) import

Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a DStream[Long]. I tried this solution: val cuenta = messages.count().foreachRDD{ rdd = rdd.first() } But

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
Yes. Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 17:12, Jeetendra Gangele gangele...@gmail.com wrote: you used ZipWithUniqueID? On 24 April 2015 at 21:28, Michal Michalski michal.michal...@boxever.com wrote: I somehow missed zipWithIndex (and Sean's

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
foreachRDD is an action and doesn't return anything. It seems like you want one final count, but that's not possible with a stream, since there is conceptually no end to a stream of data. You can get a stream of counts, which is what you have already. You can sum those counts in another data

Re: Slower performance when bigger memory?

2015-04-24 Thread Sven Krasser
FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller executors. Another observation was that one large executor results in less overall read throughput from S3 (using Amazon's EMRFS implementation) in case that matters to your application. -Sven On Thu, Apr 23, 2015 at

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks like this: +---++--+ |

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
To maintain the order you can use zipWithIndex as Sean Owen pointed out. This is the same as zipWithUniqueId except the assigned number is the index of the data in the RDD which I believe matches the order of data as it's stored on HDFS. Sent with Good (www.good.com) -Original

Re: Some questions on Multiple Streams

2015-04-24 Thread Iulian Dragoș
On Fri, Apr 24, 2015 at 4:56 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Thanks Dragos, Earlier test shows spark.streaming.concurrentJobs has worked. Glad to hear it worked! iulian Regards, Laeeq On Friday, April 24, 2015 11:58 AM, Iulian Dragoș iulian.dra...@typesafe.com

tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Daniel Mahler
I have a cluster launched with spark-ec2. I can see a TachyonMaster process running, but I do not seem to be able to use tachyon from the spark-shell. if I try rdd.saveAsTextFile(tachyon://localhost:19998/path) I get 15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID

Re: Non-Deterministic Graph Building

2015-04-24 Thread hokiegeek2
Hi Everyone, Here's the Scala code for generating the EdgeRDD, VertexRDD, and Graph: //Generate a mapping of vertex (edge) names to VertexIds val vertexNameToIdRDD = rawEdgeRDD.flatMap(x = Seq(x._1.src,x._1.dst)).distinct.zipWithUniqueId.cache //Generate VertexRDD with vertex data (in my case,

Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayListObject something; Row.create(something) will create a row with single column (and the single column contains the array) Best, Wenlei

indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
I have an RDD of LabledPoints. Is it possible to select a subset of it based on a list of indeces? For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with elements 0,4,5,6 and 8. - To unsubscribe,

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-24 Thread N B
Hi TD, That little experiment helped a bit. This time we did not see any exceptions for about 16 hours but eventually it did throw the same exceptions as before. The cleaning of the shuffle files also stopped much before these exceptions happened - about 7-1/2 hours after startup. I am not quite

Re: JAVA_HOME problem

2015-04-24 Thread sourabh chaki
I am also facing the same problem with spark 1.3.0 and yarn-client and yarn-cluster mode. Launching yarn container failed and this is the error in stderr: Container: container_1429709079342_65869_01_01

Re: problem writing to s3

2015-04-24 Thread Akhil Das
You should probably open a JIRA issue with this i think. Thanks Best Regards On Fri, Apr 24, 2015 at 3:27 AM, Daniel Mahler dmah...@gmail.com wrote: Hi Akhil I can confirm that the problem goes away when jsonRaw and jsonClean are in different s3 buckets. thanks Daniel On Thu, Apr 23,

DAG

2015-04-24 Thread Giovanni Paolo Gibilisco
Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code

Spark Internal Job Scheduling

2015-04-24 Thread Arpit1286
I came across the feature in spark where it allows you to schedule different tasks within a spark context. I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key value RDD [(K1,K2),V] and and a

Spark on Mesos

2015-04-24 Thread Stephen Carman
aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache

Re: StreamingContext.textFileStream issue

2015-04-24 Thread Yang Lei
I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context:

Re: Spark on Mesos

2015-04-24 Thread Tim Chen
failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler

Re: Spark on Mesos

2015-04-24 Thread Yang Lei
failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler

Re: indexing an RDD [Python]

2015-04-24 Thread Sven Krasser
The solution depends largely on your use case. I assume the index is in the key. In that case, you can make a second RDD out of the list of indices and then use cogroup() on both. If the list of indices is small, just using filter() will work well. If you need to read back a few select values to