Re: native-lzo library not available

2018-05-03 Thread ayan guha
This seems to be a Cloudera environment issue, and you might get a faster and more reliable answer in Cloudera forums. On Fri, May 4, 2018 at 3:39 PM, Fawze Abujaber wrote: > Hi Yulia, > > Thanks for you response. > > i see only lzo only for impala > > [root@xxx ~]#

Re: native-lzo library not available

2018-05-03 Thread Fawze Abujaber
Hi Yulia, Thanks for you response. i see only lzo only for impala [root@xxx ~]# locate *lzo*.so* /opt/cloudera/parcels/GPLEXTRAS-5.13.0-1.cdh5.13.0.p0.29/lib/impala/lib/libimpalalzo.so /usr/lib64/liblzo2.so.2 /usr/lib64/liblzo2.so.2.0.0 the

Re: Read or save specific blocks of a file

2018-05-03 Thread ayan guha
Is this a recommended way of reading data in the long run? I think it might be better to write or look for an InputFormat which supports the need Btw Block is designed to be hdfs internal representation to enable certain features. It would be interesting to understand the usecase where client app

Re: native-lzo library not available

2018-05-03 Thread yuliya Feldman
Jar is not enough, you need native library (*.so) - see if your "native" directory contains it drwxr-xr-x 2 cloudera-scm cloudera-scm  4096 Oct  4  2017 native and whether  java.library.path or LD_LIBRARY_PATH points/includes directory where your *.so library resides On Thursday, May 3,

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
Yes, you can usually use a broadcast join to avoid skew problems. On Wed, May 2, 2018 at 8:57 PM, Pralabh Kumar wrote: > I am performing join operation , if I convert reduce side join to map side > (no shuffle will happen) and I assume in that case this error shouldn't

[Structured streaming, V2] commit on ContinuousReader

2018-05-03 Thread Jiří Syrový
Version: 2.3, DataSourceV2, ContinuousReader Hi, We're creating a new data source to fetch data from streaming source that requires commiting received data and we would like to commit data once in a while after it has been retrieved and correctly processed and then fetch more. One option could

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread Arun Mahadevan
I think you need to group by a window (tumbling) and define watermarks (put a very low watermark or even 0) to discard the state. Here the window duration becomes your logical batch. - Arun From: kant kodali Date: Thursday, May 3, 2018 at 1:52 AM To: "user @spark"

Re: Read or save specific blocks of a file

2018-05-03 Thread Thodoris Zois
Hello Madhav, What I did is pretty straight-forward. Let's say that your HDFS block is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv. First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It will return you some very useful information including the list of

Pickling Keras models for use in UDFs

2018-05-03 Thread erp12
I would like to create a Spark UDF which returns the a prediction made with a trained Keras model. Keras models are not typically pickle-able, however I have used the monkey patch approach to making Keras models pickle-able, as described here:

Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-03 Thread Wenchen Fan
Hi Sergey, Thanks for your valuable feedback! For 1: yea this is definitely a bug and I have sent a PR to fix it. For 2: I have left my comments on the JIRA ticket. For 3: I don't quite understand it, can you give some concrete examples? For 4: yea this is a problem, but I think it's not a big

native-lzo library not available

2018-05-03 Thread Fawze Abujaber
Hi Guys, I'm running into issue where my spark jobs are failing on the below error, I'm using Spark 1.6.0 with CDH 5.13.0. I tried to figure it out with no success. Will appreciate any help or a direction how to attack this issue. User class threw exception: org.apache.spark.SparkException:

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
After doing some more research using Google. It's clear that aggregations by default are stateful in Structured Streaming. so the question now is how to do stateless aggregations(not storing the result from previous batches) using Structured Streaming 2.3.0? I am trying to do it using raw spark

question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
Hi All, I was under an assumption that one needs to run grouby(window(...)) to run any stateful operations but looks like that is not the case since any aggregation like query "select count(*) from some_view" is also stateful since it stores the result of the count from the previous batch.