Re: Aggregating over sorted data

2016-12-21 Thread Koert Kuipers
it can also be done with repartition + sortWithinPartitions + mapPartitions. perhaps not as convenient but it does not rely on undocumented behavior. i used this approach in spark-sorted. see here:

Re: Aggregating over sorted data

2016-12-21 Thread Liang-Chi Hsieh
I agreed that to make sure this work, you might need to know the Spark internal implementation for APIs such as `groupBy`. But without any more changes to current Spark implementation, I think this is the one possible way to achieve the required function to aggregate on sorted data per key.

Got an Accumulator error after adding some task metrics in Spark 2.0.2

2016-12-21 Thread ShaneTian
Hello all I've tried to add some task metrics in org.apache.spark.executor.ShuffleReadMetrics.scala in Spark 2.0.2, following the format of other existing metrics, but when submitting applications, I got these errors: ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.

Re: Aggregating over sorted data

2016-12-21 Thread Koert Kuipers
i think this works but it relies on groupBy and agg respecting the sorting. the api provides no such guarantee, so this could break in future versions. i would not rely on this i think... On Dec 20, 2016 18:58, "Liang-Chi Hsieh" wrote: Hi, Can you try the combination of

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-21 Thread Liang-Chi Hsieh
Your sample codes first select distinct zipcodes, and then save the rows of each distinct zipcode into a parquet file. So I think you can simply partition your data by using `DataFrameWriter.partitionBy` API, e.g., df.repartition("zip_code").write.partitionBy("zip_code").parquet(.)

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
If you run the main driver and other Spark jobs in client mode, you can make sure they (I meant all the drivers) are running at the same node. Of course all drivers now consume the resources at the same node. If you run the main driver in client mode, but run other Spark jobs in cluster mode,

RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that. Anyway, If you run spark applicaction you would have multiple jobs, which makes sense that it is not a problem. Thanks David. From: Naveen [mailto:hadoopst...@gmail.com] Sent: Wednesday, December 21, 2016 9:18 AM To: dev@spark.apache.org;

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Sebastian Piu
Is there any reason you need a context on the application launching the jobs? You can use SparkLauncher in a normal app and just listen for state transitions On Wed, 21 Dec 2016, 11:44 Naveen, wrote: > Hi Team, > > Thanks for your responses. > Let me give more details in

unsubscribe

2016-12-21 Thread Junyang Shen
> On Dec 21, 2016, at 5:15 AM, Al Pivonka wrote: > > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-21 Thread satyajit vegesna
Hi, val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data = spark.sql(zipval)

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
Incremental load traditionally means generating hfiles and using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the data into hbase. For your use case, the producer needs to find rows where the flag is 0 or 1. After such rows are obtained, it is up to you how the result of

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
Ok, Sure will ask. But what would be generic best practice solution for Incremental load from HBASE. On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > I haven't used Gobblin. > You can consider asking Gobblin mailing list of the first option. > > The second option would

unsubscribe

2016-12-21 Thread Mudit Kumar

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri wrote: > Hello Guys, > > I would like to understand different approach for Distributed

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Thanks Liang! I get your point. It would mean that when launching spark jobs, mode needs to be specified as client for all spark jobs. However, my concern is to know if driver's memory(which is launching spark jobs) will be used completely by the Future's(sparkcontext's) or these spawned

unsubscribe

2016-12-21 Thread Al Pivonka

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Sebastian, Yes, for fetching the details from Hive and HBase, I would want to use Spark's HiveContext etc. However, based on your point, I might have to check if JDBC based driver connection could be used to do the same. Main reason for this is to avoid a client-server architecture design.

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
OK. I think it is little unusual use pattern, but it should work. As I said before, if you want those Spark applications to share cluster resources, proper configs is needed for Spark. If you submit the main driver and all other Spark applications in client mode under yarn, you should make

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Team, Thanks for your responses. Let me give more details in a picture of how I am trying to launch jobs. Main spark job will launch other spark-job similar to calling multiple spark-submit within a Spark driver program. These spawned threads for new jobs will be totally different components,

Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS /

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
Hi, As you launch multiple Spark jobs through `SparkLauncher`, I think it actually works like you run multiple Spark applications with `spark-submit`. By default each application will try to use all available nodes. If your purpose is to share cluster resources across those Spark