Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, Finally Hive is not a solution as I cannot update the data. And for archive file I think it would be the same issue. Any other solutions ? Nicolas - Mail original - De: nib...@free.fr À: "Brett Antonides" Cc: user@spark.apache.org Envoyé: Vendredi 2 Octobre

Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, So, does Hive is a solution for my need : - I receive small messages (10KB) identified by ID (product ID for example) - Each message I receive is the last picture of my product ID, so I just want basically to store last picture products inside HDFS in order to process batch on it later.

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq. val dist = sc.parallelize(l) Following the above, can you call, e.g. count() on dist before saving ? Cheers On Fri, Oct 2, 2015 at 1:21 AM, jarias wrote: > Dear list, > > I'm experimenting a problem when trying to write any RDD to HDFS. I've > tried > with minimal

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
If you use transactional tables in hive together with insert, update, delete then it does the "concatenate " for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) Le sam. 3 oct. 2015 à 11:45, a écrit : > Hello, > So,

How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which uses group by query which is getting fired from hiveContext.sql() Now I know group by is evil but my use case I cant avoid group by I have around 7-8 fields on which I need to do group by. Also I am using df1.except(df2) which also seems heavy operation and

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
Another alternative is hbase with phoenix as the SQL layer on top Le sam. 3 oct. 2015 à 11:45, a écrit : > Hello, > So, does Hive is a solution for my need : > - I receive small messages (10KB) identified by ID (product ID for example) > - Each message I receive is the last

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
This sounds like you need to increase YARN overhead settings with the "spark.yarn.executor.memoryOverhead" parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html for more information on the setting. If that does not work for you, please provide the error messages and the command

Re: automatic start of streaming job on failure on YARN

2015-10-03 Thread Jeetendra Gangele
yes in yarn cluster mode. On 2 October 2015 at 22:10, Ashish Rangole wrote: > Are you running the job in yarn cluster mode? > On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote: > >> We've a streaming application running on yarn and we would like to

Re: Contribution in Apche Spark

2015-10-03 Thread Ted Yu
Please more of your code snippet and the complete error. See als python/pyspark/tests.py for examples. Cheers On Fri, Oct 2, 2015 at 11:56 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > While typing following line into Hortonworks terminal, I'm getting *Syntax > Error:invalid

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
Hi Alex thanks much for the reply. Please read the following for more details about my problem. http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn My each container has 8 core and 30 GB max memory. So I am using yarn-client mode using 40 executors with 27GB/2 cores. If

Re: Hive ORC Malformed while loading into spark data frame

2015-10-03 Thread Umesh Kacha
Hi Zang any idea why is this happening? I can load ORC files created by Hive table but I cant load ORC files created by Spark itself. It looks like bug. On Wed, Sep 30, 2015 at 12:03 PM, Umesh Kacha wrote: > Hi Zang thanks much please find the code below > > Working code

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
Yes the most recent version yes, or you can use phoenix on top of hbase. I recommend to try out both and see which one is the most suitable. Le sam. 3 oct. 2015 à 13:13, nibiau a écrit : > Hello, > Thanks if I understand correctly Hive can be a usable to my context ? > > Nicolas

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
Can you send over your yarn logs along with the command you are using to submit your job? *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * * On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha wrote: > Hi Alex thanks much for the reply.

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
You can update data in hive if you use the orc format Le sam. 3 oct. 2015 à 10:42, a écrit : > Hello, > Finally Hive is not a solution as I cannot update the data. > And for archive file I think it would be the same issue. > Any other solutions ? > > Nicolas > > - Mail

Can we using Spark Streaming to stream data from Hive table partitions?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which reads Hive table partitions data and processes it independently in different threads in a driver. Now data to process is huge in terms of TB my jobs are not scaling and running slow. So I am thinking to use Spark Streaming as and when data is added into Hive

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
Hi thanks I cant share yarn logs because of privacy in my company but I can tell you I have seen yarn logs there I have not found anything except YARN killing container because it is exceeds physical memory capacity. I am using the following command line script Above job launches around 1500

RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, Thanks if I understand correctly Hive can be a usable to my context ? Nicolas Envoyé depuis mon appareil mobile SamsungJörn Franke a écrit :If you use transactional tables in hive together with insert, update, delete then it does the "concatenate " for you

Re: Kafka Direct Stream

2015-10-03 Thread varun sharma
Thanks Gerardthe code snippet you shared worked.. but can you please explain/point me the usage of *collect* here. How it is different(performance/readability) from *filter.* > *val filteredRdd = rdd.filter(x=> x._1 == topic).map(_._2))* I am doing something like this.Please tell if I can

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Thanks a lot, why you said "the most recent version" ? - Mail original - De: "Jörn Franke" À: "nibiau" Cc: banto...@gmail.com, user@spark.apache.org Envoyé: Samedi 3 Octobre 2015 13:56:43 Objet: Re: RE : Re: HDFS small file generation problem Yes

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Jacinto Arias
Yes printing the result with collect or take is working, actually this is a minimal example, but also when working with real data the actions are performed, and the resulting RDDs can be printed out without problem. The data is there and the operations are correct, they just cannot be written

Re: Kafka Direct Stream

2015-10-03 Thread Gerard Maas
Hi, collect(partialFunction) is equivalent to filter(x=> partialFunction.isDefinedAt(x)).map(partialFunction) so it's functionally equivalent to your expression. I favor collect for its more compact form but that's a personal preference. Use what you feel reads best. Regarding performance,

Re: spark-submit --packages using different resolver

2015-10-03 Thread Burak Yavuz
Hi Jerry, The --packages feature doesn't support private repositories right now. However, in the case of s3, maybe it might work. Could you please try using the --repositories flag and provide the address: `$ spark-submit --packages my:awesome:package --repositories

Q: optimal way to calculate aggregates on a stream

2015-10-03 Thread igor
Hello I'm new to spark, so sorry if my question looks dumb. I have a problem which I hope to solve using spark. Here is short description: 1. I have a simple flow of the 600k tuples per minute. Each tuple is structured metric name and its value: (a.b.c.d, value) (a.b.x, value)

WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,...

2015-10-03 Thread Jacek Laskowski
Hi, The following WARN happens in Spark built from today's sources. There were some changes about RPC so it may be related. Should I report an issue in JIRA? I use a sbt project and `sbt console` to play with Spark. ``` scala> import org.apache.spark.SparkConf import org.apache.spark.SparkConf

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way. However, you may also use Hbase with phoenix for that depending on your other functional and non-functional requirements Le sam. 3 oct.

Re: performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Michael Armbrust
Underneath the covers, the thrift server is just calling hiveContext.sql(...) so this is surprising. Maybe running EXPLAIN or EXPLAIN

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
Hi Jacin, If I was you, first thing that I would do is, write a sample java application to write data into hdfs and see if it's working fine. Meta data is being created in hdfs, that means, communication to namenode is working fine but not to datanodes since you don't see any data inside the

performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Jeff Thompson
Hi, I'm running a simple SQL query over a ~700 million row table of the form: SELECT * FROM my_table WHERE id = '12345'; When I submit the query via beeline & the JDBC thrift server it returns in 35s When I submit the exact same query using sparkSQL from a pyspark shell (sqlContex.sql("SELECT *

How to make sense of Spark log entries

2015-10-03 Thread jeff saremi
There are executor logs and driver logs. Most of them are not intuitive enough to mean anything to us. Are there any notes, documents, talks on how to decipher these logs and troubleshoot our applications' performance as a result? thanks Jeff

Re: laziness in textFile reading from HDFS?

2015-10-03 Thread Matt Narrell
Is there any more information or best practices here? I have the exact same issues when reading large data sets from HDFS (larger than available RAM) and I cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, and using nearly all the cluster resources. Should I

Can't determine cause of spark driver crash

2015-10-03 Thread adamsky
I am running a job where I am consistently causing the spark driver to crash, and am unable to diagnose the cause. I am running on Databricks, but I am posting my question here in case there may be something that I am doing which is clearly a problematic operation in spark. I am trying to do a

preferredNodeLocationData, SPARK-8949, and SparkContext - a leftover?

2015-10-03 Thread Jacek Laskowski
Hi, I've been reviewing SparkContext and found preferredNodeLocationData that was made obsoleted by SPARK-8949 [1]. When you search where SparkContext.preferredNodeLocationData is used, you find 3 places - one constructor marked @deprecated, the other with logWarning telling us that "Passing in

Re: WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,...

2015-10-03 Thread Ted Yu
Did you use spark-shell ? In spark-shell, there can only be one running SparkContext which is created automatically. Cheers On Sat, Oct 3, 2015 at 11:27 AM, Jacek Laskowski wrote: > Hi, > > The following WARN happens in Spark built from today's sources. There > were some

Re: How to make sense of Spark log entries

2015-10-03 Thread Ted Yu
Every commonly seen error has been discussed multiple times. Meaning, you can find related discussions / JIRAs using indexing services, such as: http://search-hadoop.com/ Here is one related talk: http://www.slideshare.net/Hadoop_Summit/why-your-spark-job-is-failing FYI On Sat, Oct 3, 2015 at

Re: WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,...

2015-10-03 Thread Jacek Laskowski
Hi Ted, You're absolutely right - twice! It was initially indeed in spark-shell and there can only be only active SparkContext. I did `sc.stop`, but it didn't help much. That's why I switched to a Scala/sbt/Spark project and run it all from scratch (as shown). The stacktrace was showing up

Re: Contribution in Apche Spark

2015-10-03 Thread Chintan Bhatt
While typing following line into Hortonworks terminal, I'm getting *Syntax Error:invalid syntax* myLines_filtered = myLines.filter( lambda x: len(x) > 0 ) On Wed, Sep 9, 2015 at 12:56 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Thanx Akhil > Can you provide me any industry