Re: Invalidating/Remove complete mapWithState state

2017-04-17 Thread Matthias Niehoff
That would be the hard way, but if possible I want to clear the cache without stopping the application, maybe triggered by a message in the stream. Am 17. April 2017 um 19:41 schrieb ayan guha : > It sounds like you want to stop the stream process, wipe out the check > point

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jayant Shekhar
Hello Gaurav, Pre-calculating the results would obviously be a great idea - and load the results into a serving store from where you serve it out to your customers - as suggested by Jorn. And run it every hour/day, depending on your requirements. Zeppelin (as mentioned by Ayan) would not be a

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Amol Patil
@Ayan - Creating temp table dynamically based on dataset name. I will explore df.saveAsTable option. On Mon, Apr 17, 2017 at 9:53 PM, Ryan wrote: > It shouldn't be a problem then. We've done the similar thing in scala. I > don't have much experience with python thread

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
It shouldn't be a problem then. We've done the similar thing in scala. I don't have much experience with python thread but maybe the code related with reading/writing temp table isn't thread safe. On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil wrote: > Thanks Ryan, > > Each

Application not found in RM

2017-04-17 Thread Mohammad Tariq
Dear fellow Spark users, *Use case :* I have written a small java client which launches multiple Spark jobs through *SparkLauncher* and captures jobs' metrics during the course of the execution. *Issue :* Sometimes the client fails saying - *Caused by:

Re: Is there a way to tell if a receiver is a Reliable Receiver?

2017-04-17 Thread Charles O. Bajomo
The easiest way I found was to take a look at the source. Any receiver that calls the version of store that requires an iterator is considered reliable. A definitive list would be nice. Kind Regards - Original Message - From: "Justin Pihony" To: "user"

Is there a way to tell if a receiver is a Reliable Receiver?

2017-04-17 Thread Justin Pihony
I can't seem to find anywhere that would let a user know if the receiver they are using is reliable or not. Even better would be a list of known reliable receivers. Are any of these things possible? Or do you just have to research your receiver beforehand? -- View this message in context:

Re: isin query

2017-04-17 Thread Koert Kuipers
i dont see this behavior in the current spark master: scala> val df = Seq("m_123", "m_111", "m_145", "m_098", "m_666").toDF("msrid") df: org.apache.spark.sql.DataFrame = [msrid: string] scala> df.filter($"msrid".isin("m_123")).count res0: Long = 1 scala>

Handling skewed data

2017-04-17 Thread Vishnu Viswanath
Hello All, Does anyone know if the skew handling code mentioned in this talk https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark? If so can I know where to look for more info, JIRA? Pull request? Thanks in advance. Regards, Vishnu Viswanath.

Fwd: isin query

2017-04-17 Thread nayan sharma
Thanks for responding. df.filter($”msrid”===“m_123” || $”msrid”===“m_111”) there are lots of workaround to my question but Can you let know whats wrong with the “isin” query. Regards, Nayan > Begin forwarded message: > > From: ayan guha > Subject: Re: isin query > Date:

Re: isin query

2017-04-17 Thread ayan guha
How about using OR operator in filter? On Tue, 18 Apr 2017 at 12:35 am, nayan sharma wrote: > Dataframe (df) having column msrid(String) having values > m_123,m_111,m_145,m_098,m_666 > > I wanted to filter out rows which are having values m_123,m_111,m_145 > >

filter operation using isin

2017-04-17 Thread nayan sharma
Dataframe (df) having column msrid(String) having values m_123,m_111,m_145,m_098,m_666 I wanted to filter out rows which are having values m_123,m_111,m_145 df.filter($"msrid".isin("m_123","m_111","m_145")).count count =0 while df.filter($"msrid".isin("m_123")).count count=121212 I have

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. Maybe you can put the data in a HAR and store id, path in orc/parquet. > On 17. Apr 2017, at 10:52, 莫涛 wrote: > > Hi Jörn, > > > > I do think a 5 MB column is odd but I don't have any other idea

isin query

2017-04-17 Thread nayan sharma
Dataframe (df) having column msrid(String) having values m_123,m_111,m_145,m_098,m_666 I wanted to filter out rows which are having values m_123,m_111,m_145 df.filter($"msrid".isin("m_123","m_111","m_145")).count count =0 while df.filter($"msrid".isin("m_123")).count count=121212 I have

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread ayan guha
One possihility is using hive with bucketed on id column? Another option: build the index in hbase ie store id and path of hdfs in hbase. This was your scans will be fast and once you have the hdfs path pointers you can read the actual data from hdfs. On Mon, 17 Apr 2017 at 6:52 pm, 莫涛

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread ayan guha
Zeppelin is more useful for interactive data exploration. If tye reports are known beforehand then any good reporting tool should work, such as tablaue, qlic, power bi etc. zeppelin is not fit for this use case. On Mon, 17 Apr 2017 at 6:57 pm, Gaurav Pandya wrote: >

Re: Invalidating/Remove complete mapWithState state

2017-04-17 Thread ayan guha
It sounds like you want to stop the stream process, wipe out the check point and restart? On Mon, 17 Apr 2017 at 10:13 pm, Matthias Niehoff < matthias.nieh...@codecentric.de> wrote: > Hi everybody, > > is there a way to complete invalidate or remove the state used by > mapWithState, not only for

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread ayan guha
What happens if you do not use the temp table, but directly do df.saveAsTsble with mode append? If i have to guess without looking at the code of your task function, i would think the name if temp table is evaluated statically, so all threads are refering to same tsble. In other words your app is

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Amol Patil
Thanks Ryan, Each dataset has separate hive table. All hive tables belongs to same hive database. The idea is to ingest data in parallel in respective hive tables. If I run code sequentially for each data source, it works fine but I will take lot of time. We are planning to process around 30-40

Re: how to add new column using regular expression within pyspark dataframe

2017-04-17 Thread Павел
On Mon, Apr 17, 2017 at 3:25 PM, Zeming Yu wrote: > I've got a dataframe with a column looking like this: > > display(flight.select("duration").show()) > > ++ > |duration| > ++ > | 15h10m| > | 17h0m| > | 21h25m| > | 14h25m| > | 14h30m| > ++ > only

Re: Memory problems with simple ETL in Pyspark

2017-04-17 Thread ayan guha
Good to know it worked. In case some of the job still failed can indicate skew in your dataset. You may want to think of a partition by function. Also, do you still see containers killed by yarn? If so, at what point? You should see something like your app is trying to use x gb while yarn can

how to add new column using regular expression within pyspark dataframe

2017-04-17 Thread Zeming Yu
I've got a dataframe with a column looking like this: display(flight.select("duration").show()) ++ |duration| ++ | 15h10m| | 17h0m| | 21h25m| | 14h30m| | 24h50m| | 26h10m| | 14h30m| | 23h5m| | 21h30m| | 11h50m| | 16h10m| | 15h15m| | 21h25m| | 14h25m| | 14h40m| |

Invalidating/Remove complete mapWithState state

2017-04-17 Thread Matthias Niehoff
Hi everybody, is there a way to complete invalidate or remove the state used by mapWithState, not only for a given key using State#remove()? Deleting the state key by key is not an option, as a) not all possible keys are known(might be work around of course) and b) the number of keys is to big

Spark-shell's performance

2017-04-17 Thread Richard Hanson
I am playing with some data using (stand alone) spark-shell (Spark version 1.6.0) by executing `spark-shell`. The flow is simple; a bit like cp - basically moving local 100k files (the max size is 190k) to S3. Memory is configured as below export SPARK_DRIVER_MEMORY=8192M export

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Gaurav Pandya
Thanks Jorn. Yes, I will precalculate the results. Do you think Zeppelin can work here? On Mon, Apr 17, 2017 at 1:41 PM, Jörn Franke wrote: > Processing through Spark is fine, but I do not recommend that each of the > users triggers a Spark query. So either you

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Jörn, I do think a 5 MB column is odd but I don't have any other idea before asking this question. The binary data is a short video and the maximum size is no more than 50 MB. Hadoop archive sounds very interesting and I'll try it first to check whether filtering is fast on it. To my

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that states the ppd should work. And I think "only for matched ones the binary data is read" is true if proper index is configured. The row group

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, The attachment is a screen shot for the spark job and this is the only stage for this job. I've changed the partition size to 1GB by "--conf spark.sql.files.maxPartitionBytes=1073741824". 1. spark-orc seems not that smart. The input size is almost the whole data. I guess "only for

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jörn Franke
Processing through Spark is fine, but I do not recommend that each of the users triggers a Spark query. So either you precalculate the reports in Spark so that the reports themselves do not trigger Spark queries or you have a database that serves the report. For the latter case there are tons

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
You need to sort the data by id otherwise q situation can occur where the index does not work. Aside from this, it sounds odd to put a 5 MB column using those formats. This will be also not so efficient. What is in the 5 MB binary data? You could use HAR or maybe Hbase to store this kind of

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Gaurav Pandya
Thanks for the revert Jorn. In my case, I am going to put the analysis on e-commerce website so naturally users will be more and it will keep growing when e-commerce website captures market. Users will not be doing any analysis here. Reports will show their purchasing behaviour and pattern (kind

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
1. Per my understanding, for orc files, it should push down the filters, which means all id columns will be scanned but only for matched ones the binary data is read. I haven't dig into spark-orc reader though.. 2. orc itself have row group index and bloom filter index. you may try configurations

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jörn Franke
I think it highly depends on your requirements. There are various tools for analyzing and visualizing data. How many concurrent users do you have? What analysis do they do? How much data is involved? Do they have to process the data all the time or can they live with sampling which increases

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
Hi Ryan, 1. "expected qps and response time for the filter request" I expect that only the requested BINARY are scanned instead of all records, so the response time would be "10K * 5MB / disk read speed", or several times of this. In practice, our cluster has 30 SAS disks and scanning all

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > >

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi

How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread MoTao
Hi all, I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on average. In my daily application, I need to filter out 10K BINARY according to an ID list. How should I store the whole data to make the filtering faster? I'm using DataFrame in Spark 2.0.0 and I've tried row-based