spark hive branch location

2015-10-05 Thread weoccc
Hi, I would like to know where is the spark hive github location where spark build depend on ? I was told it used to be here https://github.com/pwendell/hive but it seems it is no longer there. Thanks a lot, Weide

Re: spark hive branch location

2015-10-05 Thread Michael Armbrust
I think this is the most up to date branch (used in Spark 1.5): https://github.com/pwendell/hive/tree/release-1.2.1-spark On Mon, Oct 5, 2015 at 1:03 PM, weoccc wrote: > Hi, > > I would like to know where is the spark hive github location where spark > build depend on ? I was

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Patrick Wendell
The missing artifacts are uploaded now. Things should propagate in the next 24 hours. If there are still issues past then ping this thread. Thanks! - Patrick On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas wrote: > Thanks for looking into this Josh. > > On Mon,

HiveContext in standalone mode: shuffle hang ups

2015-10-05 Thread Saif.A.Ellafi
Hi all, I have a process where local mode takes only 40 seconds. While the same on stand-alone mode, being the same node used for local mode the only available node, is taking up for ever. rdd actions hang up. I could only "sort this out" by turning speculation on, so the same task hanging is

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Blaž Šnuderl
Also missing is http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script. On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu wrote: > hadoop1 package for Scala 2.10 wasn't in RC1 either: >

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Josh Rosen
I'm working on a fix for this right now. I'm planning to re-run a modified copy of the release packaging scripts which will emit only the missing artifacts (so we won't upload new artifacts with different SHAs for the builds which *did* succeed). I expect to have this finished in the next day or

Re: failure notice

2015-10-05 Thread Renyi Xiong
if RDDs from same DStream not guaranteed to run on same worker, then the question becomes: is it possible to specify an unlimited duration in ssc to have a continuous stream (as opposed to discretized). say, we have a per node streaming engine (built-in checkpoint and recovery) we'd like to

Re: failure notice

2015-10-05 Thread Tathagata Das
What happens when a whole node running your " per node streaming engine (built-in checkpoint and recovery)" fails? Can its checkpoint and recovery mechanism handle whole node failure? Can you recover from the checkpoint on a different node? Spark and Spark Streaming were designed with the idea

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Thanks for looking into this Josh. On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen wrote: > I'm working on a fix for this right now. I'm planning to re-run a modified > copy of the release packaging scripts which will emit only the missing > artifacts (so we won't upload new

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
You can write the data to local hdfs (or local disk) and just load it from there. On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: > Thanks for your suggestion Ted. > > Unfortunately at this point of time I cannot go beyond 1000 partitions. I > am writing this data to BigQuery

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Jegan
I am sorry, I didn't understand it completely. Are you suggesting to copy the files from S3 to HDFS? Actually, that is what I am doing. I am reading the files using Spark and persisting it locally. Or did you actually mean to ask the producer to write the files directly to HDFS instead of S3? I

Re: spark hive branch location

2015-10-05 Thread weoccc
Hi Michael, Thanks for pointing me the branch. What's the build instructions to build the hive 1.2.1 release branch for spark 1.5 ? Weide On Mon, Oct 5, 2015 at 12:06 PM, Michael Armbrust wrote: > I think this is the most up to date branch (used in Spark 1.5): >

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
I meant to say just copy everything to a local hdfs, and then don't use caching ... On Mon, Oct 5, 2015 at 4:52 PM, Jegan wrote: > I am sorry, I didn't understand it completely. Are you suggesting to copy > the files from S3 to HDFS? Actually, that is what I am doing. I am

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Ted Yu
As a workaround, can you set the number of partitions higher in the sc.textFile method ? Cheers On Mon, Oct 5, 2015 at 3:31 PM, Jegan wrote: > Hi All, > > I am facing the below exception when the size of the file being read in a > partition is above 2GB. This is apparently

Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet? On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov wrote: > Hi, > > We're building our own framework on top of spark and we give users pretty > complex schema to work with. That requires from

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-05 Thread Russell Spitzer
That sounds fine to me, we already do the filtering so populating that field would be pretty simple. On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust wrote: > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Jegan
Thanks for your suggestion Ted. Unfortunately at this point of time I cannot go beyond 1000 partitions. I am writing this data to BigQuery and it has a limit of 1000 jobs per day for a table(they have some limits on this) I currently create 1 load job per partition. Is there any other

Re: Difference between a task and a job

2015-10-05 Thread Daniel Darabos
Actions trigger jobs. A job is made up of stages. A stage is made up of tasks. Executor threads execute tasks. Does that answer your question? On Mon, Oct 5, 2015 at 12:52 PM, Guna Prasaad wrote: > What is the difference between a task and a job in spark and >

StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Eugene Morozov
Hi, We're building our own framework on top of spark and we give users pretty complex schema to work with. That requires from us to build dataframes by ourselves: we transform business objects to rows and struct types and uses these two to create dataframe. Everything was fine until I started to

Difference between a task and a job

2015-10-05 Thread Guna Prasaad
What is the difference between a task and a job in spark and spark-streaming? Regards, Guna

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Yin Huai
Hello Ewan, Adding a JSON-specific option makes sense. Can you open a JIRA for this? Also, sending out a PR will be great. For JSONRelation, I think we can pass all user-specific options to it (see org.apache.spark.sql.execution.datasources.json.DefaultSource's createRelation) just like what we

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
Blaž said: Also missing is http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script. This is the package I am referring to in my original email. Nick said: It appears that almost every version of Spark up to and including 1.5.0 has included a

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
Thanks Yin, I'll put together a JIRA and a PR tomorrow. Ewan -- Original message-- From: Yin Huai Date: Mon, 5 Oct 2015 17:39 To: Ewan Leith; Cc: dev@spark.apache.org; Subject:Re: Dataframe nested schema inference from Json without type conflicts Hello Ewan, Adding a

RE: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
I've done some digging today and, as a quick and ugly fix, altering the case statement of the JSON inferField function in InferSchema.scala https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala to have case