Error in java_gateway.py

2018-08-08 Thread shubham
Following the code snippets on this thread , I got a working version of XGBoost on pyspark. But one issues I am still facing is the following File

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
an new map task after a shuffle is also a narrow dependency, isnt it? its narrow because data doesn't need to move, e.g. every partition depends on single partition, preferably on same machine. modifying a previous shuffle to avoid a shuffle strikes me as odd, and can potentially make a mess of

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Jungtaek Lim
> shouldnt coalesce introduce a new map-phase with less tasks instead of changing the previous shuffle? The javadoc of Dataset.coalesce [1] describes such behavior clearly. It results in narrow dependency, hence no shuffle. So it is pretty clear that you need to use "repartition". Not sure

unsubscribe

2018-08-08 Thread 네이버
> On 8 Aug 2018, at 17:35, Daniel Zhang wrote: > > Hi, > > I have one question related to run unit test in Intellij. > > I import spark into my Intellij as Maven project, and have no issue to build > the whole project. > > While for some unit tests, I see they can be run directly in

[Structured Streaming] Understanding waterMark, flatMapGroupWithState and possibly windowing

2018-08-08 Thread subramgr
Hi, We have a use case where we need to *sessionize* our data and for each *session* emit some *metrics* we need to handle *repeated sessions* and *sessions timeout*. We have come up with the following code structure and would like to understand if we understand all the concept of *watermark*,

Re: Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Jeff Zhang
Make sure you use the correct python which has tensorframe installed. Use PYSPARK_PYTHON to configure the python Spico Florin 于2018年8月8日周三 下午9:59写道: > Hi! > > I would like to use tensorframes in my pyspark notebook. > > I have performed the following: > > 1. In the spark intepreter adde a new

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
sorry i meant to say: wit a checkpoint i get a map phase with lots of tasks to read the data, then a reduce phase with 2048 reducers, and then finally a map phase with 100 tasks. On Wed, Aug 8, 2018 at 4:54 PM, Koert Kuipers wrote: > the only thing that seems to stop this so far is a

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
the only thing that seems to stop this so far is a checkpoint. wit a checkpoint i get a map phase with lots of tasks to read the data, then a reduce phase with 2048 reducers, and then finally a map phase with 4 tasks. now i need to figure out how to do this without having to checkpoint. i wish i

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
ok thanks. mh. that seems odd. shouldnt coalesce introduce a new map-phase with less tasks instead of changing the previous shuffle? using repartition seems too expensive just to keep the number of files down. so i guess i am back to looking for another solution. On Wed, Aug 8, 2018 at

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Vadim Semenov
`coalesce` sets the number of partitions for the last stage, so you have to use `repartition` instead which is going to introduce an extra shuffle stage On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers wrote: > > one small correction: lots of files leads to pressure on the spark driver > program

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
one small correction: lots of files leads to pressure on the spark driver program when reading this data in spark. On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers wrote: > hi, > > i am reading data from files into a dataframe, then doing a groupBy for a > given column with a count, and finally i

groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Koert Kuipers
hi, i am reading data from files into a dataframe, then doing a groupBy for a given column with a count, and finally i coalesce to a smaller number of partitions before writing out to disk. so roughly:

Data source jdbc does not support streamed reading

2018-08-08 Thread James Starks
Now my spark job can perform sql operations against database table. Next I want to combine that with streaming context, so switching to readStream() function. But after job submission, spark throws Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does

Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Spico Florin
Hi! I would like to use tensorframes in my pyspark notebook. I have performed the following: 1. In the spark intepreter adde a new repository http://dl.bintray.com/spark-packages/maven 2. in the spark interpreter added the dependency databricks:tensorframes:0.2.9-s_2.11 3. pip install

Re: Replacing groupBykey() with reduceByKey()

2018-08-08 Thread Biplob Biswas
Hi Santhosh, My name is not Bipin, its Biplob as is clear from my Signature. Regarding your question, I have no clue what your map operation is doing on the grouped data, so I can only suggest you to do : dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:

Re: need workaround around HIVE-11625 / DISTRO-800

2018-08-08 Thread Pranav Agrawal
any help please On Tue, Aug 7, 2018 at 1:49 PM, Pranav Agrawal wrote: > I am hitting issue, > https://issues.cloudera.org/browse/DISTRO-800 (related to > https://issues.apache.org/jira/browse/HIVE-11625) > > I am unable to write empty array of types int or string (array of size 0) > into

Re: Split a row into multiple rows Java

2018-08-08 Thread Manu Zhang
The following may help although in Scala. The idea is to firstly concat each value with time, assembly all time_value into an array and explode, and finally split time_value into time and value. val ndf = df.select(col("name"), col("otherName"), explode( array(concat_ws(":", col("v1"),