Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello, I have following services are configured and installed successfully: Hadoop 2.7.x Spark 2.0.x HBase 1.2.4 Hive 1.2.1 *Installation Directories:* /usr/local/hadoop /usr/local/spark /usr/local/hbase *Hive Environment variables:* #HIVE VARIABLES START export HIVE_HOME=/usr/local/hive

Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Spark Folks, Other weird experience i have with Spark with SqlContext is when i created Dataframe sometime this error throws exception and sometime not ! scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val stdDf =

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat usually by specifying the max split size. Have you looked into that possibility with your InputFormat? On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote: > Hi Jasbir, > > Yes, you are right. Do you have

Re: spark support on windows

2017-01-16 Thread Steve Loughran
On 16 Jan 2017, at 11:06, Hyukjin Kwon > wrote: Hi, I just looked through Jacek's page and I believe that is the correct way. That seems to be a Hadoop library specific issue[1]. Up to my knowledge, winutils and the binaries in the private repo

Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-16 Thread Steve Loughran
On 16 Jan 2017, at 12:51, Rostyslav Sotnychenko > wrote: Thanks all! I was using another DFS instead of HDFS, which was logging an error when fs.delete got called on non-existing path. really? Whose DFS, if you don't mind me asking?

Re: spark support on windows

2017-01-16 Thread Steve Loughran
On 16 Jan 2017, at 10:35, assaf.mendelson > wrote: Hi, In the documentation it says spark is supported on windows. The problem, however, is that the documentation description on windows is lacking. There are sources (such as

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Pradeep, That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If we have billions of InputSplit, will it be bottlenecked for the performance? That is, will too many data need to be transferred from master node to computing nodes by networking? Thanks, Fei On Mon, Jan 16,

About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Community, I am struggling to save Dataframe to Hive Table, Versions: Hive 1.2.1 Spark 2.0.1 *Working code:* /* @Author: Chetan Khatri /* @Author: Chetan Khatri Description: This Scala script has written for HBase to Hive module, which reads table from HBase and dump it out to Hive */

Re: Why are ml models repartition(1)'d in save methods?

2017-01-16 Thread Asher Krim
Cool, thanks! Jira: https://issues.apache.org/jira/browse/SPARK-19247 PR: https://github.com/apache/spark/pull/16607 I think the LDA model has the exact same issues - currently the `topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and k=1000) is saved as a single element in a

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Liang-Chi, Yes, the logic split is needed in compute(). The preferred locations can be derived from the customized Partition class. Thanks for your help! Cheers, Fei On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh wrote: > > Hi Fei, > > I think it should work. But you

Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-16 Thread Rostyslav Sotnychenko
Thanks all! I was using another DFS instead of HDFS, which was logging an error when fs.delete got called on non-existing path. In Spark 2.0.1 which I was using previously, everything was working fine because existence of an additional check that was made prior to deleting. However that check got

Re: spark support on windows

2017-01-16 Thread Hyukjin Kwon
Hi, I just looked through Jacek's page and I believe that is the correct way. That seems to be a Hadoop library specific issue[1]. Up to my knowledge, winutils and the binaries in the private repo are built by a Hadoop PMC member on a dedicated Windows VM which I believe are pretty trustable.

spark support on windows

2017-01-16 Thread assaf.mendelson
Hi, In the documentation it says spark is supported on windows. The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Liang-Chi Hsieh
Hi Fei, I think it should work. But you may need to add few logic in compute() to decide which half of the parent partition is needed to output. And you need to get the correct preferred locations for the partitions sharing the same parent partition. Fei Hu wrote > Hi Liang-Chi, > > Yes, you