Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hello Jacek,

Actually, Reynold is still the release manager and I am just sending this
message for him :) Sorry. I should have made it clear in my original email.

Thanks,

Yin

On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski  wrote:

> Hi Yan,
>
> I've been surprised the first time when I noticed rxin stepped back and a
> new release manager stepped in. Congrats on your first ANNOUNCE!
>
> I can only expect even more great stuff coming in to Spark from the dev
> team after Reynold spared some time 😉
>
> Can't wait to read the changes...
>
> Jacek
>
> On 29 Dec 2016 5:03 p.m., "Yin Huai"  wrote:
>
>> Hi all,
>>
>> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
>> makes significant strides in the production readiness of Structured
>> Streaming, with added support for event time watermarks
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-programming-guide.html#handling-late-data-and-watermarking>
>> and Kafka 0.10 support
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html>.
>> In addition, this release focuses more on usability, stability, and polish,
>> resolving over 1200 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.1.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: https://spark.apache.or
>> g/releases/spark-release-2-1-0.html
>>
>> (note: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list)
>>
>>
>>


Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai
Hello,

We spent sometime preparing artifacts and changes to the website (including
the release notes). I just sent out the the announcement. 2.1.0 is
officially released.

Thanks,

Yin

On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Interesting, because a bug that seemed to be fixed in 2.1.0-SNAPSHOT
> doesn't appear to be fixed in 2.1.0 stable (it centered around a
> null-pointer exception during code gen). It seems to be fixed in
> 2.1.1-SNAPSHOT, but I can try stable again.
>
> On Dec 28, 2016, at 1:38 PM, Mark Hamstra  wrote:
>
> A SNAPSHOT build is not a stable artifact, but rather floats to the top of
> commits that are intended for the next release.  So, 2.1.1-SNAPSHOT comes
> after the 2.1.0 release and contains any code at the time that the artifact
> was built that was committed to the branch-2.1 maintenance branch and is,
> therefore, intended for the eventual 2.1.1 maintenance release.  Once a
> release is tagged and stable artifacts for it can be built, there is no
> purpose for s SNAPSHOT of that release -- e.g. there is no longer any
> purpose for a 2.1.0-SNAPSHOT release; if you want 2.1.0, then you should be
> using stable artifacts now, not SNAPSHOTs.
>
> The existence of a SNAPSHOT doesn't imply anything about the release date
> of the associated finished version.  Rather, it only indicates a name that
> is attached to all of the code that is currently intended for the
> associated release number.
>
> On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <
> justin.mil...@protectwise.com> wrote:
>
>> It looks like the jars for 2.1.0-SNAPSHOT are gone?
>>
>> https://repository.apache.org/content/groups/snapshots/org/a
>> pache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/
>>
>> Also:
>>
>> 2.1.0-SNAPSHOT/
>> 
>>  Fri
>> Dec 23 16:31:42 UTC 2016
>> 2.1.1-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 20:01:10 UTC 2016
>> 2.2.0-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 19:12:38 UTC 2016
>>
>> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
>>
>> Thanks!
>> Justin
>>
>> On Dec 28, 2016, at 12:53 PM, Mark Hamstra 
>> wrote:
>>
>> The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
>>
>> On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers  wrote:
>>
>>> seems like the artifacts are on maven central but the website is not yet
>>> updated.
>>>
>>> strangely the tag v2.1.0 is not yet available on github. i assume its
>>> equal to v2.1.0-rc5
>>>
>>> On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
>>> justin.mil...@protectwise.com> wrote:
>>>
 I'm curious about this as well. Seems like the vote passed.

 > On Dec 23, 2016, at 2:00 AM, Aseem Bansal 
 wrote:
 >
 >


 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>
>


[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all,

Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
makes significant strides in the production readiness of Structured
Streaming, with added support for event time watermarks

and Kafka 0.10 support
.
In addition, this release focuses more on usability, stability, and polish,
resolving over 1200 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.1.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-1-0.html

(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list)


Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Yin Huai
Hello Michael,

Thank you for reporting this issue. It will be fixed by
https://github.com/apache/spark/pull/16080.

Thanks,

Yin

On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao  wrote:

> Hi!
>
> Do you have real HIVE installation?
> Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive
> -Phive-thriftserver ) ?
>
> It seems that you use "default" Spark's HIVE 1.2.1. Your metadata is
> stored in local Derby DB which is visible to concrete Spark installation
> but not for all.
>
> On Wed, Nov 30, 2016 at 4:51 AM, Michael Allman 
> wrote:
>
>> This is not an issue with all tables created in Spark 2.1, though I'm not
>> sure why some work and some do not. I have found that a table created as
>> such
>>
>> sql("create table test stored as parquet as select 1")
>>
>> in Spark 2.1 cannot be read in previous versions of Spark.
>>
>> Michael
>>
>>
>> > On Nov 29, 2016, at 5:15 PM, Michael Allman 
>> wrote:
>> >
>> > Hello,
>> >
>> > When I try to read from a Hive table created by Spark 2.1 in Spark 2.0
>> or earlier, I get an error:
>> >
>> > java.lang.ClassNotFoundException: Failed to load class for data
>> source: hive.
>> >
>> > Is there a way to get previous versions of Spark to read tables written
>> with Spark 2.1?
>> >
>> > Cheers,
>> >
>> > Michael
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
No problem! Glad it helped!

On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha  wrote:

> Hi Yin, thanks much your answer solved my problem. Really appreciate it!
>
> Regards
>
>
> On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai  wrote:
>
>> Hi, we made the change because the partitioning discovery logic was too
>> flexible and it introduced problems that were very confusing to users. To
>> make your case work, we have introduced a new data source option called
>> basePath. You can use
>>
>> DataFrame df = hiveContext.read().format("orc").option("basePath", "
>> path/to/table/").load("path/to/table/entity=xyz")
>>
>> So, the partitioning discovery logic will understand that the base path
>> is path/to/table/ and your dataframe will has the column "entity".
>>
>> You can find the doc at the end of partitioning discovery section of the
>> sql programming guide (
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> ).
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:
>>
>>> Hi from Spark 1.6 onwards as per this  doc
>>> <
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>> >
>>> We cant add specific hive partitions to DataFrame
>>>
>>> spark 1.5 the following used to work and the following dataframe will
>>> have
>>> entity column
>>>
>>> DataFrame df =
>>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>>
>>> But in Spark 1.6 above does not work and I have to give base path like
>>> the
>>> following but it does not contain entity column which I want in DataFrame
>>>
>>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>>
>>> How do I load specific hive partition in a dataframe? What was the driver
>>> behind removing this feature which was efficient I believe now above
>>> Spark
>>> 1.6 code load all partitions and if I filter for specific partitions it
>>> is
>>> not efficient it hits memory and throws GC error because of thousands of
>>> partitions get loaded into memory and not the specific one please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
Hi, we made the change because the partitioning discovery logic was too
flexible and it introduced problems that were very confusing to users. To
make your case work, we have introduced a new data source option called
basePath. You can use

DataFrame df = hiveContext.read().format("orc").option("basePath", "
path/to/table/").load("path/to/table/entity=xyz")

So, the partitioning discovery logic will understand that the base
path is path/to/table/
and your dataframe will has the column "entity".

You can find the doc at the end of partitioning discovery section of the
sql programming guide (
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
).

Thanks,

Yin

On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:

> Hi from Spark 1.6 onwards as per this  doc
> <
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> >
> We cant add specific hive partitions to DataFrame
>
> spark 1.5 the following used to work and the following dataframe will have
> entity column
>
> DataFrame df =
> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>
> But in Spark 1.6 above does not work and I have to give base path like the
> following but it does not contain entity column which I want in DataFrame
>
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>
> How do I load specific hive partition in a dataframe? What was the driver
> behind removing this feature which was efficient I believe now above Spark
> 1.6 code load all partitions and if I filter for specific partitions it is
> not efficient it hits memory and throws GC error because of thousands of
> partitions get loaded into memory and not the specific one please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Yin Huai
Hi Eran,

Can you try 1.6? With the change in
https://github.com/apache/spark/pull/10288, JSON data source will not throw
a runtime exception if there is any record that it cannot parse. Instead,
it will put the entire record to the column of "_corrupt_record".

Thanks,

Yin

On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon  wrote:

> Thanks for this!
> This was the problem...
>
> On Sun, 20 Dec 2015 at 18:49 Chris Fregly  wrote:
>
>> hey Eran, I run into this all the time with Json.
>>
>> the problem is likely that your Json is "too pretty" and extending beyond
>> a single line which trips up the Json reader.
>>
>> my solution is usually to de-pretty the Json - either manually or through
>> an ETL step - by stripping all white space before pointing my
>> DataFrame/JSON reader at the file.
>>
>> this tool is handy for one-off scenerios:  http://jsonviewer.stack.hu
>>
>> for streaming use cases, you'll want to have a light de-pretty ETL step
>> either within the Spark Streaming job after ingestion - or upstream using
>> something like a Flume interceptor, NiFi Processor (I love NiFi), or Kafka
>> transformation assuming those exist by now.
>>
>> a similar problem exists for XML, btw.  there's lots of wonky workarounds
>> for this that use MapPartitions and all kinds of craziness.  the best
>> option, in my opinion, is to just ETL/flatten the data to make the
>> DataFrame reader happy.
>>
>> On Dec 19, 2015, at 4:55 PM, Eran Witkon  wrote:
>>
>> Hi,
>> I tried the following code in spark-shell on spark1.5.2:
>>
>> *val df =
>> sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")*
>> *df.count()*
>>
>> 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size =
>> 67108864 bytes, TID = 3
>> 15/12/19 23:49:40 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID
>> 3)
>> java.lang.RuntimeException: Failed to parse a value for data type
>> StructType() (current token: VALUE_STRING).
>> at scala.sys.package$.error(package.scala:27)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:172)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:251)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:246)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:365)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
>> 1.org
>> $apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>>
>> Am I am doing something wrong?
>> Eran
>>
>>


Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Yin Huai
Can you attach the result of eventDF.filter($"entityType" ===
"user").select("entityId").distinct.explain(true)?

Thanks,

Yin

On Thu, Nov 5, 2015 at 1:12 AM, 千成徳  wrote:

> Hi All,
>
> I have data frame like this.
>
> Equality expression is not working in 1.5.1 but, works as expected in 1.4.0
> What is the difference?
>
> scala> eventDF.printSchema()
> root
>  |-- id: string (nullable = true)
>  |-- event: string (nullable = true)
>  |-- entityType: string (nullable = true)
>  |-- entityId: string (nullable = true)
>  |-- targetEntityType: string (nullable = true)
>  |-- targetEntityId: string (nullable = true)
>  |-- properties: string (nullable = true)
>
> scala> eventDF.groupBy("entityType").agg(countDistinct("entityId")).show
> +--++
> |entityType|COUNT(DISTINCT entityId)|
> +--++
> |   ib_user|4751|
> |  user|2091|
> +--++
>
>
> - not works ( Bug ? )
> scala> eventDF.filter($"entityType" ===
> "user").select("entityId").distinct.count
> res151: Long = 1219
>
> scala> eventDF.filter(eventDF("entityType") ===
> "user").select("entityId").distinct.count
> res153: Long = 1219
>
> scala> eventDF.filter($"entityType" equalTo
> "user").select("entityId").distinct.count
> res149: Long = 1219
>
> - works as expected
> scala> eventDF.map{ e => (e.getAs[String]("entityId"),
> e.getAs[String]("entityType")) }.filter(x => x._2 ==
> "user").map(_._1).distinct.count
> res150: Long = 2091
>
> scala> eventDF.filter($"entityType" in
> "user").select("entityId").distinct.count
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> res155: Long = 2091
>
> scala> eventDF.filter($"entityType" !==
> "ib_user").select("entityId").distinct.count
> res152: Long = 2091
>
>
> But, All of above code works in 1.4.0
>
> Thanks.
>
>


Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai
Hi Ross,

What version of spark are you using? There were two issues that affected
the results of window function in Spark 1.5 branch. Both of issues have
been fixed and will be released with Spark 1.5.2 (this release will happen
soon). For more details of these two issues, you can take a look at
https://issues.apache.org/jira/browse/SPARK-11135 and
https://issues.apache.org/jira/browse/SPARK-11009.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM,  wrote:

> Hello Spark community -
> I am running a Spark SQL query to calculate the difference in time between
> consecutive events, using lag(event_time) over window -
>
> SELECT device_id,
>unix_time,
>event_id,
>unix_time - lag(unix_time)
>   OVER
> (PARTITION BY device_id ORDER BY unix_time,event_id)
>  AS seconds_since_last_event
> FROM ios_d_events;
>
> This is giving me some strange results in the case where the first two
> events for a particular device_id have the same timestamp.
> I used to following query to take a look at what value was being returned
> by lag():
>
> SELECT device_id,
>event_time,
>unix_time,
>event_id,
>lag(event_time) OVER (PARTITION BY device_id ORDER BY 
> unix_time,event_id) AS lag_time
> FROM ios_d_events;
>
> I’m seeing that in these cases, I am getting something like 1970-01-03 …
> instead of a null value, and the following lag times are all following the
> same format.
>
> I posted a section of this output in this SO question:
> http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls
>
> The errant results are labeled with device_id 999.
>
> Any idea why this is occurring?
>
> - Ross
>


Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai
@filthysocks, can you get the output of jmap -histo before the OOM (
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html)?

On Mon, Oct 26, 2015 at 6:35 AM, filthysocks  wrote:

> We upgrade from 1.4.1 to 1.5 and it's a pain
> see
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-driver-memory-problems-while-doing-Cross-Validation-do-not-occur-with-1-4-1-td25076.html
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-feels-sparkSQL-in-spark1-5-1-very-slow-tp25154p25204.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Yin Huai
btw, what version of Spark did you use?

On Mon, Oct 19, 2015 at 1:08 PM, YaoPau  wrote:

> I've connected Spark SQL to the Hive Metastore and currently I'm running
> SQL
> code via pyspark.  Typically everything works fine, but sometimes after a
> long-running Spark SQL job I get the error below, and from then on I can no
> longer run Spark SQL commands.  I still do have both my sc and my sqlCtx.
>
> Any idea what this could mean?
>
> An error occurred while calling o36.sql.
> : org.apache.spark.sql.AnalysisException: Conf non-local session path
> expected to be non-null;
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:260)
> at
>
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at
>
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at
>
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at
>
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:139)
> at
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:139)
> at
>
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
> at
>
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at
>
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at
>
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:235)
> at
>
> org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
> at
>
> org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
> at scala.Option.getOrElse(Option.scala:12

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
ldLeft(List.scala:84)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC.(:39)
> at $iwC.(:41)
> at (:43)
> at .(:47)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> On Thu, Oct 1, 2015 at 12:27 PM, Yin Huai  wrote:
>
>> Hi Pala,
>>
>> Can you add the full stacktrace of the exception? For now, can you use
>> create temporary function to workaround the issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com.invalid> wrote:
>>
>>> +user list
>>>
>>> On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia <
>>> mchett...@rocketfuelinc.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to use internal UDFs that we have added as permanent
>>>> functions to Hive, from within Spark SQL query (using HiveContext), but i
>>>> encounter NoSuchObjectException, i.e. the function could not be found.
>>>>
>>>> However, if i execute 'show functions' command in spark SQL, the
>>>> permanent functions appear in the list.
>>>>
>>>> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by
>>>> looking at the log and code, but it seems both the show functions command
>>>> as well as udf query both go through essentially the same code path, but
>>>> the former can see the UDF but the latter can't.
>>>>
>>>> Any ideas on how to debug/fix this?
>>>>
>>>>
>>>> Thanks,
>>>> pala
>>>>
>>>
>>>
>>
>


Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Hi Pala,

Can you add the full stacktrace of the exception? For now, can you use
create temporary function to workaround the issue?

Thanks,

Yin

On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia <
mchett...@rocketfuelinc.com.invalid> wrote:

> +user list
>
> On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> I am trying to use internal UDFs that we have added as permanent
>> functions to Hive, from within Spark SQL query (using HiveContext), but i
>> encounter NoSuchObjectException, i.e. the function could not be found.
>>
>> However, if i execute 'show functions' command in spark SQL, the
>> permanent functions appear in the list.
>>
>> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by looking
>> at the log and code, but it seems both the show functions command as well
>> as udf query both go through essentially the same code path, but the former
>> can see the UDF but the latter can't.
>>
>> Any ideas on how to debug/fix this?
>>
>>
>> Thanks,
>> pala
>>
>
>


Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai
Hi Ritesh,

Right now, we only allow specific data types defined in the inputSchema.
Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF
be more complex. It will be great to understand the use cases first. What
kinds of possible input data types that you want to support and do you need
to know the actual argument types to determine how to process input data?

btw, for now, one possible workaround is to define multiple UDAFs for
different input types. Then, based on arguments that you have, you invoke
the corresponding UDAF.

Thanks,

Yin

On Fri, Sep 25, 2015 at 8:07 AM, Ritesh Agrawal <
ragra...@netflix.com.invalid> wrote:

> Hi all,
>
> I am trying to learn about UDAF and implemented a simple reservoir sample
> UDAF. It's working fine. However I am not able to figure out what DataType
> should I use so that its can deal with all DataTypes (simple and complex).
> For instance currently I have defined my input schema as
>
>  def inputSchema = StructType(StructField("id", StringType) :: Nil )
>
>
> Instead of StringType can I use some other data type that is superclass of
> all the DataTypes ?
>
> Ritesh
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam  wrote:

> I just noticed you found 1.4 has the same issue. I added that as well in
> the ticket.
>
> On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam  wrote:
>
>> Hi Yin,
>>
>> You are right! I just tried the scala version with the above lines, it
>> works as expected.
>> I'm not sure if it happens also in 1.4 for pyspark but I thought the
>> pyspark code just calls the scala code via py4j. I didn't expect that this
>> bug is pyspark specific. That surprises me actually a bit. I created a
>> ticket for this (SPARK-10731
>> <https://issues.apache.org/jira/browse/SPARK-10731>).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai  wrote:
>>
>>> btw, does 1.4 has the same problem?
>>>
>>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>>>
>>>> Hi Jerry,
>>>>
>>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam 
>>>> wrote:
>>>>
>>>>> Hi Spark Developers,
>>>>>
>>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>>> the execution plan of take(1), head() or first().
>>>>>
>>>>> For your reference, this is what I did in pyspark 1.5:
>>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>>> df.head()
>>>>>
>>>>> The above lines take over 15 minutes. I was frustrated because I can
>>>>> do better without using spark :) Since I like spark, so I tried to figure
>>>>> out why. It seems the dataframe requires 3 stages to give me the first 
>>>>> row.
>>>>> It reads all data (which is about 1 billion rows) and run Limit twice.
>>>>>
>>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>>> do:
>>>>>
>>>>> df.rdd.take(1) //runs much faster.
>>>>>
>>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it
>>>>> a bug in the optimizer? or I did something wrong?
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue.

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:

> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>
>> Hi Spark Developers,
>>
>> I just ran some very simple operations on a dataset. I was surprise by
>> the execution plan of take(1), head() or first().
>>
>> For your reference, this is what I did in pyspark 1.5:
>> df=sqlContext.read.parquet("someparquetfiles")
>> df.head()
>>
>> The above lines take over 15 minutes. I was frustrated because I can do
>> better without using spark :) Since I like spark, so I tried to figure out
>> why. It seems the dataframe requires 3 stages to give me the first row. It
>> reads all data (which is about 1 billion rows) and run Limit twice.
>>
>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>
>> df.rdd.take(1) //runs much faster.
>>
>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>> bug in the optimizer? or I did something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry,

Looks like it is a Python-specific issue. Can you create a JIRA?

Thanks,

Yin

On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1), head() or first().
>
> For your reference, this is what I did in pyspark 1.5:
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
>
> The above lines take over 15 minutes. I was frustrated because I can do
> better without using spark :) Since I like spark, so I tried to figure out
> why. It seems the dataframe requires 3 stages to give me the first row. It
> reads all data (which is about 1 billion rows) and run Limit twice.
>
> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>
> df.rdd.take(1) //runs much faster.
>
> Is this expected? Why head/first/take is so slow for dataframe? Is it a
> bug in the optimizer? or I did something wrong?
>
> Best Regards,
>
> Jerry
>


Re: Null Value in DecimalType column of DataFrame

2015-09-17 Thread Yin Huai
As I mentioned before, the range of values of DecimalType(10, 10) is [0,
1). If you have a value 10.5 and you want to cast it to DecimalType(10,
10), I do not think there is any better returned value except of null.
Looks like DecimalType(10, 10) is not the right type for your use case. You
need a decimal type that has precision - scale >= 2.

On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

>
> Hi Yin, posted here because I think it's a bug.
> So, it will return null and I can get a nullpointerexception, as I was
> getting. Is this really the expected behavior? Never seen something
> returning null in other Scala tools that I used.
>
> Regards,
>
>
> 2015-09-14 18:54 GMT-03:00 Yin Huai :
>
>> btw, move it to user list.
>>
>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai  wrote:
>>
>>> A scale of 10 means that there are 10 digits at the right of the decimal
>>> point. If you also have precision 10, the range of your data will be [0, 1)
>>> and casting "10.5" to DecimalType(10, 10) will return null, which is
>>> expected.
>>>
>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>> DecimalType
>>>>
>>>> This ugly code is a little sample to reproduce the error, don't use it
>>>> into your project.
>>>>
>>>> test("spark test") {
>>>>   val file = 
>>>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>>>> Row.fromSeq({
>>>> val values = f.split(",")
>>>> 
>>>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>> values.tail.tail.tail.head)}))
>>>>
>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>> StructField("int2", IntegerType, false), StructField("double",
>>>>
>>>>  DecimalType(10,10), false),
>>>>
>>>>
>>>> StructField("str2", StringType, false)))
>>>>
>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>   df.first
>>>> }
>>>>
>>>> The content of the file is:
>>>>
>>>> 1,5,10.5,va
>>>> 2,1,0.1,vb
>>>> 3,8,10.0,vc
>>>>
>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>>> DecimalType(10,10) the Decimal values
>>>> 10.5 became null, and the 0.1 works.
>>>>
>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>
>>>> Regards,
>>>> Dirceu
>>>>
>>>>
>>>
>>
>
>


Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
btw, move it to user list.

On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai  wrote:

> A scale of 10 means that there are 10 digits at the right of the decimal
> point. If you also have precision 10, the range of your data will be [0, 1)
> and casting "10.5" to DecimalType(10, 10) will return null, which is
> expected.
>
> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi all,
>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>> It seems that there was some changes in org.apache.spark.sql.types.
>> DecimalType
>>
>> This ugly code is a little sample to reproduce the error, don't use it
>> into your project.
>>
>> test("spark test") {
>>   val file = 
>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>> Row.fromSeq({
>> val values = f.split(",")
>> 
>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>> values.tail.tail.tail.head)}))
>>
>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>> StructField("int2", IntegerType, false), StructField("double",
>>
>>  DecimalType(10,10), false),
>>
>>
>> StructField("str2", StringType, false)))
>>
>>   val df = context.sqlContext.createDataFrame(file,structType)
>>   df.first
>> }
>>
>> The content of the file is:
>>
>> 1,5,10.5,va
>> 2,1,0.1,vb
>> 3,8,10.0,vc
>>
>> The problem resides in DecimalType, before 1.5 the scala wasn't required.
>> Now when using  DecimalType(12,10) it works fine, but using
>> DecimalType(10,10) the Decimal values
>> 10.5 became null, and the 0.1 works.
>>
>> Is there anybody working with DecimalType for 1.5.1?
>>
>> Regards,
>> Dirceu
>>
>>
>


Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai
Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it
has been fixed in branch 1.5. 1.5.1 release will have it.

On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Hi all ,
> After upgrade spark to 1.5 ,  Streaming throw
> java.util.NoSuchElementException: key not found occasionally , is the
> problem of data cause this error ?  please help me if anyone got similar
> problem before , Thanks very much.
>
> the exception accur when write into database.
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 
> (TID 76, slave2): java.util.NoSuchElementException: key not found: 
> ruixue.sys.session.request
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>
> at 
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
>
> at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
>
> at 
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
>
> at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>
> at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> --
> guoqing0...@yahoo.com.hk
>


Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai
For now, user-defined window function is not supported. We will add it in
future.

On Mon, Aug 24, 2015 at 6:26 AM, xander92 
wrote:

> The ultimate aim of my program is to be able to wrap an arbitrary Scala
> function (mostly will be statistics / customized rolling window metrics) in
> a UDF and evaluate them on DataFrames using the window functionality.
>
> So my main question is how do I express that a UDF takes a Frame of rows
> from a DataFrame as an argument instead of just a single row? And what sort
> of arguments would the arbitrary Scala function need to take in order to
> handle the raw data from the Frame?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-evaluate-custom-UDF-over-window-tp24419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai
Can you try to use HiveContext instead of SQLContext? Your query is trying
to create a table and persist the metadata of the table in metastore, which
is only supported by HiveContext.

On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan 
wrote:

> Hello,
>
> I’m trying to create a table with sqlContext.sql method as below:
>
> *val sc = new SparkContext()*
> *val sqlContext = new SQLContext(sc)*
>
> *import sqlContext.implicits._*
>
> *sqlContext.sql(s"""*
> *create table if not exists landing (*
> *date string,*
> *referrer string*
> *)*
> *partitioned by (partnerid string,dt string)*
> *row format delimited fields terminated by '\t' lines terminated by '\n'*
> *STORED AS TEXTFILE LOCATION 's3n://...'*
> *  "”")*
>
>
> It gives error on spark-submit:
>
> *Exception in thread "main" java.lang.RuntimeException: [2.1] failure:
> ``with'' expected but identifier create found*
>
> *create external table if not exists landing (*
>
> *^*
> * at scala.sys.package$.error(package.scala:27)*
> * at
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)*
> * at
> org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)*
>
>
>
> What can be the reason??
>


Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai
Hi Todd,

We have not got a chance to update it. We will update it after 1.5 release.

Thanks,

Yin

On Thu, Aug 13, 2015 at 6:49 AM, Todd  wrote:

> Hi,
> I got a question about the spark-sql-perf project by Databricks at 
> https://github.com/databricks/spark-sql-perf/
>
>
> The Tables.scala (
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala)
> and BigData (
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/BigData.scala)
> are  empty files.
> Is this by intention or this is a bug.
> Also,the code snippet as follows in the README.MD won't compile  as there
> is no Tables class defined in the org.apache.spark.sql.parquet package:
> (I am using Spark1.4.1, is the code compatible with Spark 1.4.1?)
>
> import org.apache.spark.sql.parquet.Tables
> // Tables in TPC-DS benchmark used by experiments.
> val tables = Tables(sqlContext)
> // Setup TPC-DS experiment
> val tpcds = new TPCDS (sqlContext = sqlContext)
>
>
>


Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
No problem:) Glad to hear that!

On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers  wrote:

> that solved it, thanks!
>
> On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers  wrote:
>
>> thanks i will try 1.4.1
>>
>> On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai  wrote:
>>
>>> Hi Koert,
>>>
>>> For the classloader issue, you probably hit
>>> https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed
>>> in Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers 
>>> wrote:
>>>
>>>> i am using scala 2.11
>>>>
>>>> spark jars are not in my assembly jar (they are "provided"), since i
>>>> launch with spark-submit
>>>>
>>>> On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers 
>>>> wrote:
>>>>
>>>>> spark 1.4.0
>>>>>
>>>>> spark-csv is a normal dependency of my project and in the assembly jar
>>>>> that i use
>>>>>
>>>>> but i also tried adding spark-csv with --package for spark-submit, and
>>>>> got the same error
>>>>>
>>>>> On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai 
>>>>> wrote:
>>>>>
>>>>>> We do this in SparkILookp (
>>>>>> https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
>>>>>> What is the version of Spark you are using? How did you add the spark-csv
>>>>>> jar?
>>>>>>
>>>>>> On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers 
>>>>>> wrote:
>>>>>>
>>>>>>> has anyone tried to make HiveContext only if the class is available?
>>>>>>>
>>>>>>> i tried this:
>>>>>>>  implicit lazy val sqlc: SQLContext = try {
>>>>>>> Class.forName("org.apache.spark.sql.hive.HiveContext", true,
>>>>>>> Thread.currentThread.getContextClassLoader)
>>>>>>>
>>>>>>> .getConstructor(classOf[SparkContext]).newInstance(sc).asInstanceOf[SQLContext]
>>>>>>>   } catch { case e: ClassNotFoundException => new SQLContext(sc) }
>>>>>>>
>>>>>>> it compiles fine, but i get classloader issues when i actually use
>>>>>>> it on a cluster. for example:
>>>>>>>
>>>>>>> Exception in thread "main" java.lang.RuntimeException: Failed to
>>>>>>> load class for data source: com.databricks.spark.csv
>>>>>>> at scala.sys.package$.error(package.scala:27)
>>>>>>> at
>>>>>>> org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
>>>>>>> at
>>>>>>> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
Hi Koert,

For the classloader issue, you probably hit
https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed in
Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear?

Thanks,

Yin

On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers  wrote:

> i am using scala 2.11
>
> spark jars are not in my assembly jar (they are "provided"), since i
> launch with spark-submit
>
> On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers  wrote:
>
>> spark 1.4.0
>>
>> spark-csv is a normal dependency of my project and in the assembly jar
>> that i use
>>
>> but i also tried adding spark-csv with --package for spark-submit, and
>> got the same error
>>
>> On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai  wrote:
>>
>>> We do this in SparkILookp (
>>> https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
>>> What is the version of Spark you are using? How did you add the spark-csv
>>> jar?
>>>
>>> On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers 
>>> wrote:
>>>
>>>> has anyone tried to make HiveContext only if the class is available?
>>>>
>>>> i tried this:
>>>>  implicit lazy val sqlc: SQLContext = try {
>>>> Class.forName("org.apache.spark.sql.hive.HiveContext", true,
>>>> Thread.currentThread.getContextClassLoader)
>>>>
>>>> .getConstructor(classOf[SparkContext]).newInstance(sc).asInstanceOf[SQLContext]
>>>>   } catch { case e: ClassNotFoundException => new SQLContext(sc) }
>>>>
>>>> it compiles fine, but i get classloader issues when i actually use it
>>>> on a cluster. for example:
>>>>
>>>> Exception in thread "main" java.lang.RuntimeException: Failed to load
>>>> class for data source: com.databricks.spark.csv
>>>> at scala.sys.package$.error(package.scala:27)
>>>> at
>>>> org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
>>>> at
>>>> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>>>>
>>>>
>>>
>>
>


Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
We do this in SparkILookp (
https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
What is the version of Spark you are using? How did you add the spark-csv
jar?

On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers  wrote:

> has anyone tried to make HiveContext only if the class is available?
>
> i tried this:
>  implicit lazy val sqlc: SQLContext = try {
> Class.forName("org.apache.spark.sql.hive.HiveContext", true,
> Thread.currentThread.getContextClassLoader)
>
> .getConstructor(classOf[SparkContext]).newInstance(sc).asInstanceOf[SQLContext]
>   } catch { case e: ClassNotFoundException => new SQLContext(sc) }
>
> it compiles fine, but i get classloader issues when i actually use it on a
> cluster. for example:
>
> Exception in thread "main" java.lang.RuntimeException: Failed to load
> class for data source: com.databricks.spark.csv
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
> at
> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
>
>


Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Yin Huai
Your query will be partitioned once. Then, a single Window operator will
evaluate these three functions. As mentioned by Harish, you can take a look
at the plan (sql("your sql...").explain()).

On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani 
wrote:

> Just once.
> You can see this by printing the optimized logical plan.
> You will see just one repartition operation.
>
> So do:
> val df = sql("your sql...")
> println(df.queryExecution.analyzed)
>
> On Mon, Jul 13, 2015 at 6:37 AM, Hao Ren  wrote:
>
>> Hi,
>>
>> I would like to know: Is there any optimization has been done for window
>> functions in Spark SQL?
>>
>> For example.
>>
>> select key,
>> max(value1) over(partition by key) as m1,
>> max(value2) over(partition by key) as m2,
>> max(value3) over(partition by key) as m3
>> from table
>>
>> The query above creates 3 fields based on the same partition rule.
>>
>> The question is:
>> Will spark-sql partition the table 3 times in the same way to get the
>> three
>> max values ? or just partition once if it finds the partition rule is the
>> same ?
>>
>> It would be nice if someone could point out some lines of code on it.
>>
>> Thank you.
>> Hao
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Window-Functions-optimization-tp23796.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
Jerrick,

Let me ask a few clarification questions. What is the version of Spark? Is
the table a hive table? What is the format of the table? Is the table
partitioned?

Thanks,

Yin

On Sun, Jul 12, 2015 at 6:01 PM, ayan guha  wrote:

> Describe computes statistics, so it will try to query the table. The one
> you are looking for is df.printSchema()
>
> On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang 
> wrote:
>
>> Hi all,
>>
>> I'm new to Spark and this question may be trivial or has already been
>> answered, but when I do a 'describe table' from SparkSQL CLI it seems to
>> try looking at all records at the table (which takes a really long time for
>> big table) instead of just giving me the metadata of the table. Would
>> appreciate if someone can give me some pointers, thanks!
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon,

Can you explain what did you mean by "It simply does not work"? You did not
see new data files?

Thanks,

Yin

On Fri, Jul 10, 2015 at 11:55 AM, Brandon White 
wrote:

> Why does this not work? Is insert into broken in 1.3.1? It does not throw
> any errors, fail, or throw exceptions. It simply does not work.
>
>
> val ssc = new StreamingContext(sc, Minutes(10))
>
> val currentStream = ssc.textFileStream(s"s3://textFileDirectory/")
> val dayBefore = sqlContext.jsonFile(s"s3://textFileDirectory/")
>
> dayBefore.saveAsParquetFile("/tmp/cache/dayBefore.parquet")
> val parquetFile = sqlContext.parquetFile("/tmp/cache/dayBefore.parquet")
> parquetFile.registerTempTable("rideaccepted")
>
> currentStream.foreachRDD { rdd =>
>   val df = sqlContext.jsonRDD(rdd)
>   df.insertInto("rideaccepted")
> }
>
> ssc.start()
>
>
> Or this?
>
> val ssc = new StreamingContext(sc, Minutes(10))
> val currentStream = ssc.textFileStream("s3://textFileDirectory")
> val day = sqlContext.jsonFile("s3://textFileDirectory")
> day.registerTempTable("rideaccepted")
>
>
> currentStream.foreachRDD { rdd =>
>   val df = sqlContext.jsonRDD(rdd)
>   df.registerTempTable("tmp_rideaccepted")
>   sqlContext.sql("insert into table rideaccepted select * from 
> tmp_rideaccepted")
> }
>
> ssc.start()
>
>
> or this?
>
> val ssc = new StreamingContext(sc, Minutes(10))
>
> val currentStream = ssc.textFileStream(s"s3://textFileDirectory/")
> val dayBefore = sqlContext.jsonFile(s"s3://textFileDirectory/")
>
> dayBefore..registerTempTable("rideaccepted")
>
> currentStream.foreachRDD { rdd =>
>   val df = sqlContext.jsonRDD(rdd)
>   df.insertInto("rideaccepted")
> }
>
> ssc.start()
>
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
You meant "SPARK_REPL_OPTS"? I did a quick search. Looks like it has been
removed since 1.0. I think it did not affect the behavior of the shell.

On Mon, Jul 6, 2015 at 9:04 AM, Simeon Simeonov  wrote:

>   Yin, that did the trick.
>
>  I'm curious what was the effect of the environment variable, however, as
> the behavior of the shell changed from hanging to quitting when the env var
> value got to 1g.
>
>  /Sim
>
>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>
>
>   From: Yin Huai 
> Date: Monday, July 6, 2015 at 11:41 AM
> To: Denny Lee 
> Cc: Simeon Simeonov , Andy Huang ,
> user 
>
> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>
>   Hi Sim,
>
>  I think the right way to set the PermGen Size is through driver extra
> JVM options, i.e.
>
>  --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m"
>
>  Can you try it? Without this conf, your driver's PermGen size is still
> 128m.
>
>  Thanks,
>
>  Yin
>
> On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee  wrote:
>
>>  I went ahead and tested your file and the results from the tests can be
>> seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.
>>
>>  Basically, when running {Java 7, MaxPermSize = 256} or {Java 8,
>> default} the query ran without any issues.  I was able to recreate the
>> issue with {Java 7, default}.  I included the commands I used to start the
>> spark-shell but basically I just used all defaults (no alteration to driver
>> or executor memory) with the only additional call was with
>> driver-class-path to connect to MySQL Hive metastore.  This is on OSX
>> Macbook Pro.
>>
>>  One thing I did notice is that your version of Java 7 is version 51
>> while my version of Java 7 version 79.  Could you see if updating to Java 7
>> version 79 perhaps allows you to use the MaxPermSize call?
>>
>>
>>
>>
>>  On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov  wrote:
>>
>>>  The file is at
>>> https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1
>>>
>>>  The command was included in the gist
>>>
>>>  SPARK_REPL_OPTS="-XX:MaxPermSize=256m"
>>> spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
>>> com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g
>>>
>>>  /Sim
>>>
>>>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
>>> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>>>
>>>
>>>   From: Yin Huai 
>>> Date: Monday, July 6, 2015 at 12:59 AM
>>> To: Simeon Simeonov 
>>> Cc: Denny Lee , Andy Huang <
>>> andy.hu...@servian.com.au>, user 
>>>
>>> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>>>
>>>   I have never seen issue like this. Setting PermGen size to 256m
>>> should solve the problem. Can you send me your test file and the command
>>> used to launch the spark shell or your application?
>>>
>>>  Thanks,
>>>
>>>  Yin
>>>
>>> On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov  wrote:
>>>
>>>>   Yin,
>>>>
>>>>  With 512Mb PermGen, the process still hung and had to be kill -9ed.
>>>>
>>>>  At 1Gb the spark shell & associated processes stopped hanging and
>>>> started exiting with
>>>>
>>>>  scala> println(dfCount.first.getLong(0))
>>>> 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
>>>> called with curMem=0, maxMem=2223023063
>>>> 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
>>>> values in memory (estimated size 229.5 KB, free 2.1 GB)
>>>> 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184)
>>>> called with curMem=235040, maxMem=2223023063
>>>> 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
>>>> stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
>>>> 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added
>>>> broadcast_2_piece0 in memory on localhost:65464 (size: 19.7 KB, free: 2.1
>>>> GB)
>>>> 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from
>>>> first at :30
>>>> java.lang.OutOfMemoryError: PermGen space
>>>> Stopping spark co

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
Hi Sim,

I think the right way to set the PermGen Size is through driver extra JVM
options, i.e.

--conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m"

Can you try it? Without this conf, your driver's PermGen size is still 128m.

Thanks,

Yin

On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee  wrote:

> I went ahead and tested your file and the results from the tests can be
> seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.
>
> Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
> the query ran without any issues.  I was able to recreate the issue with
> {Java 7, default}.  I included the commands I used to start the spark-shell
> but basically I just used all defaults (no alteration to driver or executor
> memory) with the only additional call was with driver-class-path to connect
> to MySQL Hive metastore.  This is on OSX Macbook Pro.
>
> One thing I did notice is that your version of Java 7 is version 51 while
> my version of Java 7 version 79.  Could you see if updating to Java 7
> version 79 perhaps allows you to use the MaxPermSize call?
>
>
>
>
> On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov  wrote:
>
>>  The file is at
>> https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1
>>
>>  The command was included in the gist
>>
>>  SPARK_REPL_OPTS="-XX:MaxPermSize=256m"
>> spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
>> com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g
>>
>>  /Sim
>>
>>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
>> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>>
>>
>>   From: Yin Huai 
>> Date: Monday, July 6, 2015 at 12:59 AM
>> To: Simeon Simeonov 
>> Cc: Denny Lee , Andy Huang <
>> andy.hu...@servian.com.au>, user 
>>
>> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>>
>>   I have never seen issue like this. Setting PermGen size to 256m should
>> solve the problem. Can you send me your test file and the command used to
>> launch the spark shell or your application?
>>
>>  Thanks,
>>
>>  Yin
>>
>> On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov  wrote:
>>
>>>   Yin,
>>>
>>>  With 512Mb PermGen, the process still hung and had to be kill -9ed.
>>>
>>>  At 1Gb the spark shell & associated processes stopped hanging and
>>> started exiting with
>>>
>>>  scala> println(dfCount.first.getLong(0))
>>> 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
>>> called with curMem=0, maxMem=2223023063
>>> 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
>>> values in memory (estimated size 229.5 KB, free 2.1 GB)
>>> 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184)
>>> called with curMem=235040, maxMem=2223023063
>>> 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
>>> stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
>>> 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added
>>> broadcast_2_piece0 in memory on localhost:65464 (size: 19.7 KB, free: 2.1
>>> GB)
>>> 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from
>>> first at :30
>>> java.lang.OutOfMemoryError: PermGen space
>>> Stopping spark context.
>>> Exception in thread "main"
>>> Exception: java.lang.OutOfMemoryError thrown from the
>>> UncaughtExceptionHandler in thread "main"
>>> 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
>>> broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
>>> GB)
>>>
>>>  That did not change up until 4Gb of PermGen space and 8Gb for driver &
>>> executor each.
>>>
>>>  I stopped at this point because the exercise started looking silly. It
>>> is clear that 1.4.0 is using memory in a substantially different manner.
>>>
>>>  I'd be happy to share the test file so you can reproduce this in your
>>> own environment.
>>>
>>>  /Sim
>>>
>>>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
>>> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>>>
>>>
>>>   From: Yin Huai 
>>> Date: Sunday, July 5, 2015 at 11:04 PM
>>> To: Denny Lee 
>>> Cc: Andy Huang , Simeon Simeonov <
>>> s...@

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
I have never seen issue like this. Setting PermGen size to 256m should
solve the problem. Can you send me your test file and the command used to
launch the spark shell or your application?

Thanks,

Yin

On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov  wrote:

>   Yin,
>
>  With 512Mb PermGen, the process still hung and had to be kill -9ed.
>
>  At 1Gb the spark shell & associated processes stopped hanging and
> started exiting with
>
>  scala> println(dfCount.first.getLong(0))
> 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040) called
> with curMem=0, maxMem=2223023063
> 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
> values in memory (estimated size 229.5 KB, free 2.1 GB)
> 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184) called
> with curMem=235040, maxMem=2223023063
> 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
> stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
> 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
> in memory on localhost:65464 (size: 19.7 KB, free: 2.1 GB)
> 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from first
> at :30
> java.lang.OutOfMemoryError: PermGen space
> Stopping spark context.
> Exception in thread "main"
> Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "main"
> 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
> broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
> GB)
>
>  That did not change up until 4Gb of PermGen space and 8Gb for driver &
> executor each.
>
>  I stopped at this point because the exercise started looking silly. It
> is clear that 1.4.0 is using memory in a substantially different manner.
>
>  I'd be happy to share the test file so you can reproduce this in your
> own environment.
>
>  /Sim
>
>  Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/>
> @simeons <http://twitter.com/simeons> | blog.simeonov.com | 617.299.6746
>
>
>   From: Yin Huai 
> Date: Sunday, July 5, 2015 at 11:04 PM
> To: Denny Lee 
> Cc: Andy Huang , Simeon Simeonov ,
> user 
> Subject: Re: 1.4.0 regression: out-of-memory errors on small data
>
>   Sim,
>
>  Can you increase the PermGen size? Please let me know what is your
> setting when the problem disappears.
>
>  Thanks,
>
>  Yin
>
> On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee  wrote:
>
>>  I had run into the same problem where everything was working swimmingly
>> with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
>> Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
>> HTH!
>>
>>
>>
>>  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang 
>> wrote:
>>
>>> We have hit the same issue in spark shell when registering a temp table.
>>> We observed it happening with those who had JDK 6. The problem went away
>>> after installing jdk 8. This was only for the tutorial materials which was
>>> about loading a parquet file.
>>>
>>>  Regards
>>> Andy
>>>
>>> On Sat, Jul 4, 2015 at 2:54 AM, sim  wrote:
>>>
>>>> @bipin, in my case the error happens immediately in a fresh shell in
>>>> 1.4.0.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
>>>>  Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>>  --
>>>  Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
>>> f: 02 9376 0730| m: 0433221979
>>>
>>
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
Sim,

Can you increase the PermGen size? Please let me know what is your setting
when the problem disappears.

Thanks,

Yin

On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee  wrote:

> I had run into the same problem where everything was working swimmingly
> with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
> Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
> HTH!
>
>
>
> On Mon, Jul 6, 2015 at 8:31 AM Andy Huang 
> wrote:
>
>> We have hit the same issue in spark shell when registering a temp table.
>> We observed it happening with those who had JDK 6. The problem went away
>> after installing jdk 8. This was only for the tutorial materials which was
>> about loading a parquet file.
>>
>> Regards
>> Andy
>>
>> On Sat, Jul 4, 2015 at 2:54 AM, sim  wrote:
>>
>>> @bipin, in my case the error happens immediately in a fresh shell in
>>> 1.4.0.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
>> f: 02 9376 0730| m: 0433221979
>>
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim,

Seems you already set the PermGen size to 256m, right? I notice that in
your the shell, you created a HiveContext (it further increased the memory
consumption on PermGen). But, spark shell has already created a HiveContext
for you (sqlContext. You can use asInstanceOf to access HiveContext's
methods). Can you just use the sqlContext created by the shell and try
again?

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai  wrote:

> Hi Sim,
>
> Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3
> (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you
> add --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the
> command you used to launch Spark shell? This will increase the PermGen size
> from 128m (our default) to 256m.
>
> Thanks,
>
> Yin
>
> On Thu, Jul 2, 2015 at 12:40 PM, sim  wrote:
>
>> A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1
>> and
>> fails with a series of out-of-memory errors in 1.4.0.
>>
>> This gist <https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4>
>> includes the code and the full output from the 1.3.1 and 1.4.0 runs,
>> including the command line showing how spark-shell is started.
>>
>> Should the 1.4.0 spark-shell be started with different options to avoid
>> this
>> problem?
>>
>> Thanks,
>> Sim
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim,

Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3
(explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you
add --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the
command you used to launch Spark shell? This will increase the PermGen size
from 128m (our default) to 256m.

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:40 PM, sim  wrote:

> A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1
> and
> fails with a series of out-of-memory errors in 1.4.0.
>
> This gist 
> includes the code and the full output from the 1.3.1 and 1.4.0 runs,
> including the command line showing how spark-shell is started.
>
> Should the 1.4.0 spark-shell be started with different options to avoid
> this
> problem?
>
> Thanks,
> Sim
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Yin Huai
Axel,

Can you file a jira and attach your code in the description of the jira?
This looks like a bug.

For the third row of df1, the name is "alice" instead of "carol", right?
Otherwise, "carol" should appear in the expected output.

Btw, to get rid of those columns with the same name after the join, you can
use select to pick columns you want to include in the results.

Thanks,

Yin

On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I would test it against 1.3 to be sure, because it could -- though
> unlikely -- be a regression. For example, I recently stumbled upon this
> issue  which was
> specific to 1.4.
>
> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl  wrote:
>
>> I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
>> code would be failing right now.
>>
>> On Saturday, June 27, 2015, Nicholas Chammas 
>> wrote:
>>
>>> Yeah, you shouldn't have to rename the columns before joining them.
>>>
>>> Do you see the same behavior on 1.3 vs 1.4?
>>>
>>> Nick
>>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl 님이 작성:
>>>
 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha  wrote:

> You can declare the schema with unique names before creation of df.
> On 27 Jun 2015 13:01, "Axel Dahl"  wrote:
>
>>
>> I have the following code:
>>
>> from pyspark import SQLContext
>>
>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 
>> 3}]
>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>
>> r1 = sc.parallelize(d1)
>> r2 = sc.parallelize(d2)
>>
>> sqlContext = SQLContext(sc)
>> df1 = sqlContext.createDataFrame(d1)
>> df2 = sqlContext.createDataFrame(d2)
>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>> 'left_outer').collect()
>>
>>
>> When I run it I get the following, (notice in the first row, all join
>> keys are take from the right-side and so are blanked out):
>>
>> [Row(age=2, country=None, name=None, colour=None, country=None,
>> name=None),
>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>> country=u'usa', name=u'bob'),
>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>> country=u'ire', name=u'alice')]
>>
>> I would expect to get (though ideally without duplicate columns):
>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>> name=None),
>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>> country=u'usa', name=u'bob'),
>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>> country=u'ire', name=u'alice')]
>>
>> The workaround for now is this rather clunky piece of code:
>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>> 'name2').withColumnRenamed('country', 'country2')
>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>> 'left_outer').collect()
>>
>> So to me it looks like a bug, but am I doing something wrong?
>>
>> Thanks,
>>
>> -Axel
>>
>>
>>
>>
>>



Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Yin Huai
The function accepted by explode is f: Row => TraversableOnce[A]. Seems
user_mentions is an array of structs. So, can you change your
pattern matching to the following?

case Row(rows: Seq[_]) => rows.asInstanceOf[Seq[Row]].map(elem => ...)

On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones 
wrote:

> Hi All,
>
> I am using the new *Apache Spark version 1.4.0 Data-frames API* to
> extract information from Twitter's Status JSON, mostly focused on the Entities
> Object  - the relevant
> part to this question is showed below:
>
> {
>   ...
>   ...
>   "entities": {
> "hashtags": [],
> "trends": [],
> "urls": [],
> "user_mentions": [
>   {
> "screen_name": "linobocchini",
> "name": "Lino Bocchini",
> "id": 187356243,
> "id_str": "187356243",
> "indices": [ 3, 16 ]
>   },
>   {
> "screen_name": "jeanwyllys_real",
> "name": "Jean Wyllys",
> "id": 23176,
> "id_str": "23176",
> "indices": [ 79, 95 ]
>   }
> ],
> "symbols": []
>   },
>   ...
>   ...
> }
>
> There are several examples on how extract information from primitives
> types as string, integer, etc - but I couldn't find anything on how to
> process those kind of *complex* structures.
>
> I tried the code below but it is still doesn't work, it throws an Exception
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> val tweets = sqlContext.read.json("tweets.json")
>
> // this function is just to filter empty entities.user_mentions[] nodes
> // some tweets doesn't contains any mentions
> import org.apache.spark.sql.functions.udf
> val isEmpty = udf((value: List[Any]) => value.isEmpty)
>
> import org.apache.spark.sql._
> import sqlContext.implicits._
> case class UserMention(id: Long, idStr: String, indices: Array[Long], name: 
> String, screenName: String)
>
> val mentions = tweets.select("entities.user_mentions").
>   filter(!isEmpty($"user_mentions")).
>   explode($"user_mentions") {
>   case Row(arr: Array[Row]) => arr.map { elem =>
> UserMention(
>   elem.getAs[Long]("id"),
>   elem.getAs[String]("is_str"),
>   elem.getAs[Array[Long]]("indices"),
>   elem.getAs[String]("name"),
>   elem.getAs[String]("screen_name"))
>   }
> }
>
> mentions.first
>
> Exception when I try to call mentions.first:
>
> scala> mentions.first
> 15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
> scala.MatchError: [List([187356243,187356243,List(3, 16),Lino 
> Bocchini,linobocchini], [23176,23176,List(79, 95),Jean 
> Wyllys,jeanwyllys_real])] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
> at 
> $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
> at 
> $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
> at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
> at 
> org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)
>
> What is wrong here? I understand it is related to the types but I couldn't
> figure out it yet.
>
> As additional context, the structure mapped automatically is:
>
> scala> mentions.printSchema
> root
>  |-- user_mentions: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: long (nullable = true)
>  |||-- id_str: string (nullable = true)
>  |||-- indices: array (nullable = true)
>  ||||-- element: long (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- screen_name: string (nullable = true)
>
> *NOTE 1:* I know it is possible to solve this using HiveQL but I would
> like to use Data-frames once there is so much momentum around it.
>
> SELECT explode(entities.user_mentions) as mentions
> FROM tweets
>
> *NOTE 2:* the *UDF* val isEmpty = udf((value: List[Any]) => value.isEmpty) is
> a ugly hack and I'm missing something here, but was the only way I came up
> to avoid a NPE
>
> I’ve posted the same question on SO:
> http://stackoverflow.com/questions/31016156/how-to-extract-complex-json-structures-using-apache-spark-1-4-0-data-frames
>
> Thanks all!
> - gustavo
>
>


Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James,

Maybe it's the DISTINCT causing the issue.

I rewrote the query as follows. Maybe this one can finish faster.

select
  sum(cnt) as uses,
  count(id) as users
from (
  select
count(*) cnt,
cast(id as string) as id,
  from usage_events
  where
from_unixtime(cast(timestamp_millis/1000 as bigint)) between
'2015-06-09' and '2015-06-16'
  group by cast(id as string)
) tmp

Thanks,

Yin

On Mon, Jun 22, 2015 at 12:55 PM, Jörn Franke  wrote:

> Generally (not only spark sql specific) you should not cast in the where
> part of a sql query. It is also not necessary in your case. Getting rid of
> casts in the whole query will be also beneficial.
>
> Le lun. 22 juin 2015 à 17:29, James Aley  a
> écrit :
>
>> Hello,
>>
>> A colleague of mine ran the following Spark SQL query:
>>
>> select
>>   count(*) as uses,
>>   count (distinct cast(id as string)) as users
>> from usage_events
>> where
>>   from_unixtime(cast(timestamp_millis/1000 as bigint))
>> between '2015-06-09' and '2015-06-16'
>>
>> The table contains billions of rows, but totals only 64GB of data across
>> ~30 separate files, which are stored as Parquet with LZO compression in S3.
>>
>> From the referenced columns:
>>
>> * id is Binary, which we cast to a String so that we can DISTINCT by it.
>> (I was already told this will improve in a later release, in a separate
>> thread.)
>> * timestamp_millis is a long, containing a unix timestamp with
>> millisecond resolution
>>
>> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
>> instances, using 20 executors, each with 4GB memory. I can see from
>> monitoring tools that the CPU usage is at 100% on all nodes, but incoming
>> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
>>
>> Does that seem slow? Can anyone offer any ideas by glancing at the query
>> as to why this might be slow? We'll profile it meanwhile and post back if
>> we find anything ourselves.
>>
>> A side issue - I've found that this query, and others, sometimes
>> completes but doesn't return any results. There appears to be no error that
>> I can see in the logs, and Spark reports the job as successful, but the
>> connected JDBC client (SQLWorkbenchJ in this case), just sits there forever
>> waiting. I did a quick Google and couldn't find anyone else having similar
>> issues.
>>
>>
>> Many thanks,
>>
>> James.
>>
>


Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
If you are writing to an existing hive table, our insert into operator
follows hive's requirement, which is
"*the dynamic partition columns must be specified last among the columns in
the SELECT statement and in the same order** in which they appear in the
PARTITION() clause*."

You can find requirement in
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions.

If you use select to reorder columns, I think it should work. Also, since
the table is an existing hive table, you do not need to specify the format
because we will use the format of existing table.

btw, please feel free to open a jira about removing this requirement for
inserting into an existing hive table.

Thanks,

Yin

On Thu, Jun 18, 2015 at 9:39 PM, Yin Huai  wrote:

> Are you writing to an existing hive orc table?
>
> On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian  wrote:
>
>> Thanks for reporting this. Would you mind to help creating a JIRA for
>> this?
>>
>>
>> On 6/16/15 2:25 AM, patcharee wrote:
>>
>>> I found if I move the partitioned columns in schemaString and in Row to
>>> the end of the sequence, then it works correctly...
>>>
>>> On 16. juni 2015 11:14, patcharee wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using spark 1.4 and HiveContext to append data into a partitioned
>>>> hive table. I found that the data insert into the table is correct, but the
>>>> partition(folder) created is totally wrong.
>>>> Below is my code snippet>>
>>>>
>>>> ---
>>>>
>>>> val schemaString = "zone z year month date hh x y height u v w ph phb t
>>>> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
>>>> val schema =
>>>>   StructType(
>>>> schemaString.split(" ").map(fieldName =>
>>>>   if (fieldName.equals("zone") || fieldName.equals("z") ||
>>>> fieldName.equals("year") || fieldName.equals("month") ||
>>>>   fieldName.equals("date") || fieldName.equals("hh") ||
>>>> fieldName.equals("x") || fieldName.equals("y"))
>>>> StructField(fieldName, IntegerType, true)
>>>>   else
>>>> StructField(fieldName, FloatType, true)
>>>> ))
>>>>
>>>> val pairVarRDD =
>>>> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
>>>>
>>>> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
>>>>
>>>> 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
>>>> ))
>>>>
>>>> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
>>>>
>>>> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
>>>>
>>>> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
>>>>
>>>>
>>>> ---
>>>>
>>>>
>>>> The table contains 23 columns (longer than Tuple maximum length), so I
>>>> use Row Object to store raw data, not Tuple.
>>>> Here is some message from spark when it saved data>>
>>>>
>>>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming
>>>> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
>>>> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true
>>>>
>>>> 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
>>>> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
>>>> with partSpec {zone=13195, z=0, year=0, month=0}
>>>>
>>>> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
Are you writing to an existing hive orc table?

On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian  wrote:

> Thanks for reporting this. Would you mind to help creating a JIRA for this?
>
>
> On 6/16/15 2:25 AM, patcharee wrote:
>
>> I found if I move the partitioned columns in schemaString and in Row to
>> the end of the sequence, then it works correctly...
>>
>> On 16. juni 2015 11:14, patcharee wrote:
>>
>>> Hi,
>>>
>>> I am using spark 1.4 and HiveContext to append data into a partitioned
>>> hive table. I found that the data insert into the table is correct, but the
>>> partition(folder) created is totally wrong.
>>> Below is my code snippet>>
>>>
>>> ---
>>>
>>> val schemaString = "zone z year month date hh x y height u v w ph phb t
>>> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
>>> val schema =
>>>   StructType(
>>> schemaString.split(" ").map(fieldName =>
>>>   if (fieldName.equals("zone") || fieldName.equals("z") ||
>>> fieldName.equals("year") || fieldName.equals("month") ||
>>>   fieldName.equals("date") || fieldName.equals("hh") ||
>>> fieldName.equals("x") || fieldName.equals("y"))
>>> StructField(fieldName, IntegerType, true)
>>>   else
>>> StructField(fieldName, FloatType, true)
>>> ))
>>>
>>> val pairVarRDD =
>>> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
>>>
>>> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
>>>
>>> 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
>>> ))
>>>
>>> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
>>>
>>> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
>>>
>>> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
>>>
>>>
>>> ---
>>>
>>>
>>> The table contains 23 columns (longer than Tuple maximum length), so I
>>> use Row Object to store raw data, not Tuple.
>>> Here is some message from spark when it saved data>>
>>>
>>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming
>>> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
>>> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true
>>>
>>> 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
>>> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
>>> with partSpec {zone=13195, z=0, year=0, month=0}
>>>
>>> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3.
>>> But spark created a partition {zone=13195, z=0, year=0, month=0}.
>>>
>>> When I queried from hive>>
>>>
>>> hive> select * from test4dimBySpark;
>>> OK
>>> 242200931.00.0218.0365.09989.497
>>> 29.62711319.0717930.11982734-3174.681297735.2 16.389032
>>> -96.6289125135.3652.6476808E-50.0 13195000
>>> hive> select zone, z, year, month from test4dimBySpark;
>>> OK
>>> 13195000
>>> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
>>> Found 2 items
>>> -rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39
>>> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1
>>>
>>> The data stored in the table is correct zone = 2, z = 42, year = 2009,
>>> month = 3, but the partition created was wrong
>>> "zone=13195/z=0/year=0/month=0"
>>>
>>> Is this a bug or what could be wrong? Any suggestion is appreciated.
>>>
>>> BR,
>>> Patcharee
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
btw, user listt will be a better place for this thread.

On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai  wrote:

> Is it the full stack trace?
>
> On Thu, Jun 18, 2015 at 6:39 AM, Sea <261810...@qq.com> wrote:
>
>> Hi, all:
>>
>> I want to run spark sql on yarn(yarn-client), but ... I already set
>> "spark.yarn.jar" and  "spark.jars" in conf/spark-defaults.conf.
>>
>> ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > 
>> game.txt
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/deploy/yarn/ExecutorLauncher
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.deploy.yarn.ExecutorLauncher
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>> Could not find the main class:
>> org.apache.spark.deploy.yarn.ExecutorLauncher.  Program will exit.
>>
>>
>> Anyone can help?
>>
>
>


Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Yin Huai
So, the second attemp of those tasks failed with NPE can complete and the
job eventually finished?

On Mon, Jun 15, 2015 at 10:37 PM, Night Wolf  wrote:

> Hey Yin,
>
> Thanks for the link to the JIRA. I'll add details to it. But I'm able to
> reproduce it, at least in the same shell session, every time I do a write I
> get a random number of tasks failing on the first run with the NPE.
>
> Using dynamic allocation of executors in YARN mode. No speculative
> execution is enabled.
>
> On Tue, Jun 16, 2015 at 3:11 PM, Yin Huai  wrote:
>
>> I saw it once but I was not clear how to reproduce it. The jira I created
>> is https://issues.apache.org/jira/browse/SPARK-7837.
>>
>> More information will be very helpful. Were those errors from speculative
>> tasks or regular tasks (the first attempt of the task)? Is this error
>> deterministic (can you reproduce every time you run this command)?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf 
>> wrote:
>>
>>> Looking at the logs of the executor, looks like it fails to find the
>>> file; e.g. for task 10323.0
>>>
>>>
>>> 15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException
>>> trying to rename
>>> maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161340__m_010181_0/part-r-353626.gz.parquet
>>> to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353626.gz.parquet
>>> java.io.IOException: Invalid source or target
>>> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
>>> at
>>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
>>> at
>>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
>>> at
>>> org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
>>> at
>>> org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
>>> at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
>>> $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
>>> at
>>> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
>>> at
>>> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>> 15/06/16 13:43:13 ERROR mapred.SparkHadoopMapRedUtil: Error committing
>>> the output of task: attempt_201506161340__m_010181_0
>>> java.io.IOException: Invalid source or target
>>> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
>>> at
>>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
>>> at
>>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
>>> at
>>> org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
>>> at
>>> org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
>>> at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
>>> $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
>>> at
>>> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
>&g

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai
Hi John,

Did you also set spark.sql.planner.externalSort to true? Probably you will
not see executor lost with this conf. For now, maybe you can manually split
the query to two parts, one for skewed keys and one for other records.
Then, you union then results of these two parts together.

Thanks,

Yin

On Wed, Jun 17, 2015 at 9:53 AM, Koert Kuipers  wrote:

> could it be composed maybe? a general version and then a sql version that
> exploits the additional info/abilities available there and uses the general
> version internally...
>
> i assume the sql version can benefit from the logical phase optimization
> to pick join details. or is there more?
>
> On Tue, Jun 16, 2015 at 7:37 PM, Michael Armbrust 
> wrote:
>
>> this would be a great addition to spark, and ideally it belongs in spark
>>> core not sql.
>>>
>>
>> I agree with the fact that this would be a great addition, but we would
>> likely want a specialized SQL implementation for performance reasons.
>>
>
>


Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Yin Huai
I saw it once but I was not clear how to reproduce it. The jira I created
is https://issues.apache.org/jira/browse/SPARK-7837.

More information will be very helpful. Were those errors from speculative
tasks or regular tasks (the first attempt of the task)? Is this error
deterministic (can you reproduce every time you run this command)?

Thanks,

Yin

On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf  wrote:

> Looking at the logs of the executor, looks like it fails to find the file;
> e.g. for task 10323.0
>
>
> 15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException trying
> to rename
> maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161340__m_010181_0/part-r-353626.gz.parquet
> to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353626.gz.parquet
> java.io.IOException: Invalid source or target
> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
> at
> org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
> at
> org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
> at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
> $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 15/06/16 13:43:13 ERROR mapred.SparkHadoopMapRedUtil: Error committing the
> output of task: attempt_201506161340__m_010181_0
> java.io.IOException: Invalid source or target
> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
> at
> org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
> at
> org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
> at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
> $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 15/06/16 13:43:16 ERROR output.FileOutputCommitter: Hit IOException trying
> to rename
> maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161341__m_010323_0/part-r-353768.gz.parquet
> to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353768.gz.parquet
> java.io.IOException: Invalid source or target
> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
> at
> org.apac

Re: Issues with `when` in Column class

2015-06-12 Thread Yin Huai
Hi Chris,

Have you imported "org.apache.spark.sql.functions._"?

Thanks,

Yin

On Fri, Jun 12, 2015 at 8:05 AM, Chris Freeman  wrote:

>  I’m trying to iterate through a list of Columns and create new Columns
> based on a condition. However, the when method keeps giving me errors that
> don’t quite make sense.
>
>  If I do `when(col === “abc”, 1).otherwise(0)` I get the following error
> at compile time:
>
>  [error] not found: value when
>
>  However, this works in the REPL just fine after I import
> org.apache.spark.sql.Column.
>
>  On the other hand, if I do `col.when(col === “abc”, 1).otherwise(0)`, it
> will compile successfully, but then at runtime, I get this error:
>
>  java.lang.IllegalArgumentException: when() can only be applied on a
> Column previously generated by when() function
>
>  This appears to be pretty circular logic. How can `when` only be applied
> to a Column previously generated by `when`? How would I call `when` in the
> first place?
>


Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-05 Thread Yin Huai
Hi Doug,

For now, I think you can use "sqlContext.sql("USE databaseName")" to change
the current database.

Thanks,

Yin

On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai  wrote:

> Hi Doug,
>
> sqlContext.table does not officially support database name. It only
> supports table name as the parameter. We will add a method to support
> database name in future.
>
> Thanks,
>
> Yin
>
> On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog 
> wrote:
>
>> Hi Yin,
>>  I’m very surprised to hear that its not supported in 1.3 because I’ve
>> been using it since 1.3.0.
>> It worked great up until  SPARK-6908 was merged into master.
>>
>> What is the supported way to get  DF for a table that is not in the
>> default database ?
>>
>> IMHO, If you are not going to support “databaseName.tableName”,
>> sqlContext.table() should have a version that takes a database and a table,
>> ie
>>
>> def table(databaseName: String, tableName: String): DataFrame =
>>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
>>
>> The handling of databases in Spark(sqlContext, hiveContext, Catalog)
>> could be better.
>>
>> Thanks,
>>
>> Doug
>>
>> > On Jun 3, 2015, at 8:21 PM, Yin Huai  wrote:
>> >
>> > Hi Doug,
>> >
>> > Actually, sqlContext.table does not support database name in both Spark
>> 1.3 and Spark 1.4. We will support it in future version.
>> >
>> > Thanks,
>> >
>> > Yin
>> >
>> >
>> >
>> > On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog 
>> wrote:
>> > Hi,
>> >
>> > sqlContext.table(“db.tbl”) isn’t working for me, I get a
>> NoSuchTableException.
>> >
>> > But I can access the table via
>> >
>> > sqlContext.sql(“select * from db.tbl”)
>> >
>> > So I know it has the table info from the metastore.
>> >
>> > Anyone else see this ?
>> >
>> > I’ll keep digging.
>> > I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
>> -Phive-thriftserver
>> > It worked for me in 1.3.1
>> >
>> > Cheers,
>> >
>> > Doug
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>> >
>>
>>
>


Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-04 Thread Yin Huai
Hi Doug,

sqlContext.table does not officially support database name. It only
supports table name as the parameter. We will add a method to support
database name in future.

Thanks,

Yin

On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog  wrote:

> Hi Yin,
>  I’m very surprised to hear that its not supported in 1.3 because I’ve
> been using it since 1.3.0.
> It worked great up until  SPARK-6908 was merged into master.
>
> What is the supported way to get  DF for a table that is not in the
> default database ?
>
> IMHO, If you are not going to support “databaseName.tableName”,
> sqlContext.table() should have a version that takes a database and a table,
> ie
>
> def table(databaseName: String, tableName: String): DataFrame =
>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
>
> The handling of databases in Spark(sqlContext, hiveContext, Catalog) could
> be better.
>
> Thanks,
>
> Doug
>
> > On Jun 3, 2015, at 8:21 PM, Yin Huai  wrote:
> >
> > Hi Doug,
> >
> > Actually, sqlContext.table does not support database name in both Spark
> 1.3 and Spark 1.4. We will support it in future version.
> >
> > Thanks,
> >
> > Yin
> >
> >
> >
> > On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog 
> wrote:
> > Hi,
> >
> > sqlContext.table(“db.tbl”) isn’t working for me, I get a
> NoSuchTableException.
> >
> > But I can access the table via
> >
> > sqlContext.sql(“select * from db.tbl”)
> >
> > So I know it has the table info from the metastore.
> >
> > Anyone else see this ?
> >
> > I’ll keep digging.
> > I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
> -Phive-thriftserver
> > It worked for me in 1.3.1
> >
> > Cheers,
> >
> > Doug
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
>


Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai
Are you using RC4?

On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf  wrote:

> Thanks Yin, that seems to work with the Shell. But on a compiled
> application with Spark-submit it still fails with the same exception.
>
> On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai  wrote:
>
>> Can you put the following setting in spark-defaults.conf and try again?
>>
>> spark.sql.hive.metastore.sharedPrefixes
>> com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
>>
>> https://issues.apache.org/jira/browse/SPARK-7819 has more context about
>> it.
>>
>> On Wed, Jun 3, 2015 at 9:38 PM, Night Wolf 
>> wrote:
>>
>>> Hi all,
>>>
>>> Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client
>>> mode with Hive support.
>>>
>>> *Build command;*
>>> ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn
>>> -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided
>>> -Dhadoop.version=2.5.1-mapr-1501 -Dyarn.version=2.5.1-mapr-1501 -DskipTests
>>> -e -X
>>>
>>>
>>> When trying to run a hive query in the spark shell *sqlContext.sql("show
>>> tables")* I get the following exception;
>>>
>>> scala> sqlContext.sql("show tables")
>>> 15/06/04 04:33:16 INFO hive.HiveContext: Initializing
>>> HiveMetastoreConnection version 0.13.1 using Spark classes.
>>> java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at com.mapr.fs.ShimLoader.loadNativeLibrary(ShimLoader.java:323)
>>> at com.mapr.fs.ShimLoader.load(ShimLoader.java:198)
>>> at
>>> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:59)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:274)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1857)
>>> at
>>> org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2072)
>>> at
>>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2282)
>>> at
>>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2234)
>>> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2151)
>>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1002)
>>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:974)
>>> at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
>>> at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
>>> at org.apache.hadoop.mapred.JobConf.(JobConf.java:430)
>>> at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:1366)
>>> at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:1332)
>>> at
>>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:99)
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
>>> at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:166)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:370)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:370)
>>> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:369)
>>> at
>>> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:382)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:382)
>>> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveCont

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Yin Huai
Can you put the following setting in spark-defaults.conf and try again?

spark.sql.hive.metastore.sharedPrefixes
com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni

https://issues.apache.org/jira/browse/SPARK-7819 has more context about it.

On Wed, Jun 3, 2015 at 9:38 PM, Night Wolf  wrote:

> Hi all,
>
> Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode 
> with
> Hive support.
>
> *Build command;*
> ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4
> -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided
> -Dhadoop.version=2.5.1-mapr-1501 -Dyarn.version=2.5.1-mapr-1501 -DskipTests
> -e -X
>
>
> When trying to run a hive query in the spark shell *sqlContext.sql("show
> tables")* I get the following exception;
>
> scala> sqlContext.sql("show tables")
> 15/06/04 04:33:16 INFO hive.HiveContext: Initializing
> HiveMetastoreConnection version 0.13.1 using Spark classes.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.mapr.fs.ShimLoader.loadNativeLibrary(ShimLoader.java:323)
> at com.mapr.fs.ShimLoader.load(ShimLoader.java:198)
> at
> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:59)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1857)
> at
> org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2072)
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2282)
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2234)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2151)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1002)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:974)
> at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
> at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:430)
> at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:1366)
> at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:1332)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:99)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:166)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:370)
> at
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:370)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:369)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:382)
> at
> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:382)
> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:381)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
> at
> $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:21)
> at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
> at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
> at $line37.$read$$iwC$$iwC$$iwC$$iwC.(:32)
> at $line37.$read$$iwC$$iwC$$iwC.(:34)
> at $line37.$read$$iwC$$iwC.(:36)
> at $line37.$read$$iwC.(:38)
> at $line37.$read.(:40)
> at $line37.$read$.(:44)
> at $line37.$read$.()
> at $line37.$eval$.(:7)
> at $line37.$eval$.()
> at $line37.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-03 Thread Yin Huai
Hi Doug,

Actually, sqlContext.table does not support database name in both Spark 1.3
and Spark 1.4. We will support it in future version.

Thanks,

Yin



On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog 
wrote:

> Hi,
>
> sqlContext.table(“db.tbl”) isn’t working for me, I get a
> NoSuchTableException.
>
> But I can access the table via
>
> sqlContext.sql(“select * from db.tbl”)
>
> So I know it has the table info from the metastore.
>
> Anyone else see this ?
>
> I’ll keep digging.
> I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
> -Phive-thriftserver
> It worked for me in 1.3.1
>
> Cheers,
>
> Doug
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai
Does it happen every time you read a parquet source?

On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg  wrote:

> The log is from the log aggregation tool (hortonworks, "yarn logs ..."),
> so both executors and driver. I'll send a private mail to you with the full
> logs. Also, tried another job as you suggested, and it actually worked
> fine. The first job was reading from a parquet source, and the second from
> an avro source. Could there be some issues with the parquet reader?
>
> Thanks,
> Anders
>
> On Tue, Jun 2, 2015 at 11:53 AM, Shixiong Zhu  wrote:
>
>> How about other jobs? Is it an executor log, or a driver log? Could you
>> post other logs near this error, please? Thank you.
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-06-02 17:11 GMT+08:00 Anders Arpteg :
>>
>>> Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that
>>> worked fine for Spark 1.3. The job starts on the cluster (yarn-cluster
>>> mode), initial stage starts, but the job fails before any task succeeds
>>> with the following error. Any hints?
>>>
>>> [ERROR] [06/02/2015 09:05:36.962] [Executor task launch worker-0]
>>> [akka.tcp://sparkDriver@10.254.6.15:33986/user/CoarseGrainedScheduler]
>>> swallowing exception during message send
>>> (akka.remote.RemoteTransportExceptionNoStackTrace)
>>> Exception in thread "main" akka.actor.ActorNotFound: Actor not found
>>> for: ActorSelection[Anchor(akka.tcp://sparkDriver@10.254.6.15:33986/),
>>> Path(/user/OutputCommitCoordinator)]
>>> at
>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
>>> at
>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>>> at
>>> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
>>> at
>>> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
>>> at
>>> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
>>> at
>>> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
>>> at
>>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>>> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
>>> at
>>> akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
>>> at
>>> akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
>>> at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
>>> at
>>> akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
>>> at
>>> akka.remote.EndpointManager$$anonfun$1.applyOrElse(Remoting.scala:575)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>
>>>
>>
>


Re: Spark 1.3.0 -> 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-30 Thread Yin Huai
Looks like your program somehow picked up a older version of parquet (spark
1.3.1 uses parquet 1.6.0rc3 and seems NO_FILTER field was introduced in
1.6.0rc2). Is it possible that you can check the parquet lib version in
your classpath?

Thanks,

Yin

On Sat, May 30, 2015 at 2:44 PM, ogoh  wrote:

>
> I had the same issue on AWS EMR with Spark 1.3.1.e (AWS version) passed
> with
> '-h' parameter (it is bootstrap action parameter for spark).
> I don't see the problem with Spark 1.3.1.e not passing the parameter.
> I am not sure about your env.
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-1-3-1-produces-java-lang-NoSuchFieldError-NO-FILTER-tp22897p23090.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar,

We just added it in Spark 1.4.

In Spark 1.4, You can use window function in HiveContext to do it. Assuming
you want to calculate the cumulative sum for every flag,

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df.select(
  $"flag",
  $"price",

sum($"price").over(Window.partitionBy("flag").orderBy("price").rowsBetween(Long.MinValue,
0


In the code, over lets Spark SQL knows that you want to use window function
sum. partitionBy("flag") will partition the table by the value of flag and
the sum's scope is a single partition. orderBy("price") will sort rows in a
partition based on the value of price (probably this does not really matter
for your case. But using orderBy will make the result deterministic).
Finally, rowsBetween(Long.MinValue, 0) means that the sum value for every
row is calculated from price values of the first row in the partition to
the current row (so, you get the cumulative sum).

Thanks,

Yin

On Fri, May 29, 2015 at 8:09 AM, Cesar Flores  wrote:

> What will be the more appropriate method to add a cumulative sum column to
> a data frame. For example, assuming that I have the next data frame:
>
> flag | price
> --
> 1|47.808764653746
> 1|47.808764653746
> 1|31.9869279512204
>
>
> How can I create a data frame with an extra cumsum column as the next one:
>
> flag | price  | cumsum_price
> --|---
> 1|47.808764653746 | 47.808764653746
> 1|47.808764653746 | 95.6175293075
> 1|31.9869279512204| 127.604457259
>
>
> Thanks
> --
> Cesar Flores
>


Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai
Can you attach the full stack trace?

Thanks,

Yin

On Fri, May 8, 2015 at 4:44 PM, barmaley  wrote:

> Given a registered table from data frame, I'm able to execute queries like
> sqlContext.sql("SELECT STDDEV(col1) FROM table") from Spark Shell just
> fine.
> However, when I run exactly the same code in a standalone app on a cluster,
> it throws an exception: "java.util.NoSuchElementException: key not found:
> STDDEV"...
>
> Is STDDEV ia among default functions in Spark SQL? I'd appreciate if you
> could comment what's going on with the above.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-STDDEV-working-in-Spark-Shell-but-not-in-a-standalone-app-tp22825.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
oh, I missed that. It is fixed in 1.3.0.

Also, Jianshi, the dataset was not generated by Spark SQL, right?

On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu  wrote:

> Yin:
> Fix Version of SPARK-4520 is not set.
> I assume it was fixed in 1.3.0
>
> Cheers
> Fix Version
>
> On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai  wrote:
>
>> The exception looks like the one mentioned in
>> https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
>> Spark?
>>
>> On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang 
>> wrote:
>>
>>> Hi,
>>>
>>> My data looks like this:
>>>
>>> +---++--+
>>> | col_name  | data_type  | comment  |
>>> +---++--+
>>> | cust_id   | string |  |
>>> | part_num  | int|  |
>>> | ip_list   | array>   |  |
>>> | vid_list  | array>  |  |
>>> | fso_list  | array>  |  |
>>> | src   | string |  |
>>> | date  | int|  |
>>> +---++--+
>>>
>>> And I did select *, it reports ParquetDecodingException.
>>>
>>> Is this type not supported in SparkSQL?
>>>
>>> Detailed error message here:
>>>
>>>
>>> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
>>> Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
>>> stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): 
>>> parquet.io.ParquetDecodingException:
>>> Can not read value at 0 in block -1 in file 
>>> hdfs://xxx/part-m-0.gz.parquet
>>> at 
>>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>>> at 
>>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
>>> at 
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at 
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>> at 
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>> at 
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>> at 
>>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>> at 
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>> at 
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>> at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
>>> at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
>>> at 
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>> at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:724)
>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>>> 

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
The exception looks like the one mentioned in
https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
Spark?

On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang 
wrote:

> Hi,
>
> My data looks like this:
>
> +---++--+
> | col_name  | data_type  | comment  |
> +---++--+
> | cust_id   | string |  |
> | part_num  | int|  |
> | ip_list   | array>   |  |
> | vid_list  | array>  |  |
> | fso_list  | array>  |  |
> | src   | string |  |
> | date  | int|  |
> +---++--+
>
> And I did select *, it reports ParquetDecodingException.
>
> Is this type not supported in SparkSQL?
>
> Detailed error message here:
>
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): 
> parquet.io.ParquetDecodingException:
> Can not read value at 0 in block -1 in file hdfs://xxx/part-m-0.gz.parquet
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
> at java.util.ArrayList.elementData(ArrayList.java:400)
> at java.util.ArrayList.get(ArrayList.java:413)
> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
> at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
> at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
> at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:290)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
> at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
> at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
> at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai
Hi Sergio,

I missed this thread somehow... For the error "case classes cannot have
more than 22 parameters.", it is the limitation of scala (see
https://issues.scala-lang.org/browse/SI-7296). You can follow the
instruction at
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
to create a table with more than 22 columns. Basically, you first create an
RDD[Row] and a schema of the table represented by StructType. Then, you use
createDataFrame to apply the schema.

Thanks,

Yin

On Fri, Apr 24, 2015 at 10:44 AM, Sergio Jiménez Barrio <
drarse.a...@gmail.com> wrote:

> Solved! I have solved the problem combining both solutions. The result is
> this:
>
> messages.foreachRDD { rdd =>
>   val message: RDD[String] = rdd.map { y => y._2 }
>   val sqlContext = 
> SQLContextSingleton.getInstance(rdd.sparkContext)
>   import sqlContext.implicits._
>   val df :DataFrame = sqlContext.jsonRDD(message).toDF()
>   df.groupBy("classification").count().show()
>   println("")
> }
>
>
>
>
>
> With the SQLContextSingleton the function of Spark Documentation
> Thanks for all!
>
>
>
> 2015-04-23 10:29 GMT+02:00 Sergio Jiménez Barrio :
>
>> Thank you ver much, Tathagata!
>>
>>
>> El miércoles, 22 de abril de 2015, Tathagata Das 
>> escribió:
>>
>>> Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin
>>> for more information).
>>>
>>>
>>> On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio <
>>> drarse.a...@gmail.com> wrote:
>>>
 Sorry, this is the error:

 [error] /home/sergio/Escritorio/hello/streaming.scala:77:
 Implementation restriction: case classes cannot have more than 22
 parameters.



 2015-04-22 16:06 GMT+02:00 Sergio Jiménez Barrio >>> >:

> I tried the solution of the guide, but I exceded the size of case
> class Row:
>
>
> 2015-04-22 15:22 GMT+02:00 Tathagata Das 
> :
>
>> Did you checkout the latest streaming programming guide?
>>
>>
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
>>
>> You also need to be aware of that to convert json RDDs to dataframe,
>> sqlContext has to make a pass on the data to learn the schema. This will
>> fail if a batch has no data. You have to safeguard against that.
>>
>> On Wed, Apr 22, 2015 at 6:19 AM, ayan guha 
>> wrote:
>>
>>> What about sqlcontext.createDataframe(rdd)?
>>> On 22 Apr 2015 23:04, "Sergio Jiménez Barrio" 
>>> wrote:
>>>
 Hi,

 I am using Kafka with Apache Stream to send JSON to Apache Spark:

 val messages = KafkaUtils.createDirectStream[String, String, 
 StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

 Now, I want parse the DStream created to DataFrame, but I don't
 know if Spark 1.3 have some easy way for this. ¿Any suggestion? I can 
 get
 the message with:

 val lines = messages.map(_._2)

 Thank u for all. Sergio J.



>>
>

>>>
>>
>> --
>> Atte. Sergio Jiménez
>>
>
>


Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai
Hi Shuai,

You can use "as" to create a table alias. For example, df1.as("df1"). Then
you can use $"df1.col" to refer it.

Thanks,

Yin

On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> I use 1.3.1
>
>
>
> When I have two DF and join them on a same name key, after that, I can’t
> get the common key by name.
>
>
>
> Basically:
>
> select * from t1 inner join t2 on t1.col1 = t2.col1
>
>
>
> And I am using purely DataFrame, spark SqlContext not HiveContext
>
>
>
> DataFrame df3 = df1.join(df2, df1.col(col).equalTo(df2.col(col))).select(
> *col*);
>
>
>
> because df1 and df2 join on the same key col,
>
>
>
> Then I can't reference the key col. I understand I should use a full
> qualified name for that column (like in SQL, use t1.col), but I don’t know
> how should I address this in spark sql.
>
>
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> Reference 'id' is ambiguous, could be: id#8L, id#0L.;
>
>
>
> It looks that joined key can't be referenced by name or by df1.col name
> pattern.
>
> The https://issues.apache.org/jira/browse/SPARK-5278 refer to a hive
> case, so I am not sure whether it is the same issue, but I still have the
> issue in latest code.
>
>
>
> It looks like the result after join won't keep the parent DF information
> anywhere?
>
>
>
> I check the ticket: https://issues.apache.org/jira/browse/SPARK-6273
>
>
>
> But not sure whether  it is the same issue? Should I open a new ticket for
> this?
>
>
>
> Regards,
>
>
>
> Shuai
>
>
>


Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex,

Can you try 1.3.1? With PR 5339  ,
after we get a hive parquet from metastore and convert it to our native
parquet code path, we will cache the converted relation. For now, the first
access to that hive parquet table reads all of the footers (when you first
refer to that table in a query or call
sqlContext.table("hiveParquetTable")). All of your later accesses will hit
the metadata cache.

Thanks,

Yin

On Tue, Apr 21, 2015 at 1:13 AM, Rex Xiong  wrote:

> We have the similar issue with massive parquet files, Cheng Lian, could
> you have a look?
>
> 2015-04-08 15:47 GMT+08:00 Zheng, Xudong :
>
>> Hi Cheng,
>>
>> I tried both these patches, and seems still not resolve my issue. And I
>> found the most time is spend on this line in newParquet.scala:
>>
>> ParquetFileReader.readAllFootersInParallel(
>>   sparkContext.hadoopConfiguration, seqAsJavaList(leaves),
>> taskSideMetaData)
>>
>> Which need read all the files under the Parquet folder, while our Parquet
>> folder has a lot of Parquet files (near 2000), read one file need about 2
>> seconds, so it become very slow ... And the PR 5231 did not skip this steps
>> so it not resolve my issue.
>>
>> As our Parquet files are generated by a Spark job, so the number of
>> .parquet files is same with the number of tasks, that is why we have so
>> many files. But these files actually have the same schema. Is there any way
>> to merge these files into one, or avoid scan each of them?
>>
>> On Sat, Apr 4, 2015 at 9:47 PM, Cheng Lian  wrote:
>>
>>>  Hey Xudong,
>>>
>>> We had been digging this issue for a while, and believe PR 5339
>>>  and PR 5334
>>>  should fix this issue.
>>>
>>> There two problems:
>>>
>>> 1. Normally we cache Parquet table metadata for better performance, but
>>> when converting Hive metastore Hive tables, the cache is not used. Thus
>>> heavy operations like schema discovery is done every time a metastore
>>> Parquet table is converted.
>>> 2. With Parquet task side metadata reading (which is turned on by
>>> default), we can actually skip the row group information in the footer.
>>> However, we accidentally called a Parquet function which doesn't skip row
>>> group information.
>>>
>>> For your question about schema merging, Parquet allows different
>>> part-files have different but compatible schemas. For example,
>>> part-1.parquet has columns a and b, while part-2.parquet may has
>>> columns a and c. In some cases, the summary files (_metadata and
>>> _common_metadata) contains the merged schema (a, b, and c), but it's not
>>> guaranteed. For example, when the user defined metadata stored different
>>> part-files contain different values for the same key, Parquet simply gives
>>> up writing summary files. That's why all part-files must be touched to get
>>> a precise merged schema.
>>>
>>> However, in scenarios where a centralized arbitrative schema is
>>> available (e.g. Hive metastore schema, or the schema provided by user via
>>> data source DDL), we don't need to do schema merging on driver side, but
>>> defer it to executor side and each task only needs to reconcile those
>>> part-files it needs to touch. This is also what the Parquet developers did
>>> recently for parquet-hadoop
>>> .
>>>
>>> Cheng
>>>
>>>
>>> On 3/31/15 11:49 PM, Zheng, Xudong wrote:
>>>
>>> Thanks Cheng!
>>>
>>>  Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues,
>>> but the PR 5231 seems not. Not sure any other things I did wrong ...
>>>
>>>  BTW, actually, we are very interested in the schema merging feature in
>>> Spark 1.3, so both these two solution will disable this feature, right? It
>>> seems that Parquet metadata is store in a file named _metadata in the
>>> Parquet file folder (each folder is a partition as we use partition table),
>>> why we need scan all Parquet part files? Is there any other solutions could
>>> keep schema merging feature at the same time? We are really like this
>>> feature :)
>>>
>>> On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian 
>>> wrote:
>>>
  Hi Xudong,

 This is probably because of Parquet schema merging is turned on by
 default. This is generally useful for Parquet files with different but
 compatible schemas. But it needs to read metadata from all Parquet
 part-files. This can be problematic when reading Parquet files with lots of
 part-files, especially when the user doesn't need schema merging.

 This issue is tracked by SPARK-6575, and here is a PR for it:
 https://github.com/apache/spark/pull/5231. This PR adds a
 configuration to disable schema merging by default when doing Hive
 metastore Parquet table conversion.

 Another workaround is to fallback to the old Parquet code by setting
 spark.sql.parquet.useDataSourceApi to false.

 Che

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar,

Can you try 1.3.1 (
https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it
still shows the error?

Thanks,

Yin

On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin  wrote:

> This is strange. cc the dev list since it might be a bug.
>
>
>
> On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores  wrote:
>
>> Never mind. I found the solution:
>>
>> val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
>> hiveLoadedDataFrame.schema)
>>
>> which translate to convert the data frame to rdd and back again to data
>> frame. Not the prettiest solution, but at least it solves my problems.
>>
>>
>> Thanks,
>> Cesar Flores
>>
>>
>>
>> On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores  wrote:
>>
>>>
>>> I have a data frame in which I load data from a hive table. And my issue
>>> is that the data frame is missing the columns that I need to query.
>>>
>>> For example:
>>>
>>> val newdataset = dataset.where(dataset("label") === 1)
>>>
>>> gives me an error like the following:
>>>
>>> ERROR yarn.ApplicationMaster: User class threw exception: resolved
>>> attributes label missing from label, user_id, ...(the rest of the fields of
>>> my table
>>> org.apache.spark.sql.AnalysisException: resolved attributes label
>>> missing from label, user_id, ... (the rest of the fields of my table)
>>>
>>> where we can see that the label field actually exist. I manage to solve
>>> this issue by updating my syntax to:
>>>
>>> val newdataset = dataset.where($"label" === 1)
>>>
>>> which works. However I can not make this trick in all my queries. For
>>> example, when I try to do a unionAll from two subsets of the same data
>>> frame the error I am getting is that all my fields are missing.
>>>
>>> Can someone tell me if I need to do some post processing after loading
>>> from hive in order to avoid this kind of errors?
>>>
>>>
>>> Thanks
>>> --
>>> Cesar Flores
>>>
>>
>>
>>
>> --
>> Cesar Flores
>>
>
>


Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal,

spark.sql.shuffle.partitions is the property to control the number of tasks
after shuffle (to generate t2, there is a shuffle for the aggregations
specified by groupBy and agg.) You can use
sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or
sqlContext.sql("set spark.sql.shuffle.partitions=newNumber") to set the
value of it.

For t2.repartition(10).collect, because your t2 has 200 partitions, the
first stage had 200 tasks (for the second stage, there will be 10 tasks).
So, if you have something like val t3 = t2.repartition(10), t3 will have 10
partitions.

Thanks,

Yin

On Thu, Apr 16, 2015 at 3:04 PM, Neal Yin  wrote:

>  I have some trouble to control number of spark tasks for a stage.  This
> on latest spark 1.3.x source code build.
>
>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sc.getConf.get("spark.default.parallelism”)  —> setup to *10*
> val t1 = hiveContext.sql("FROM SalesJan2009 select * ")
> val t2 = t1.groupBy("country", "state",
> "city").agg(avg("price").as("aprive”))
>
>  t1.rdd.partitions.size   —>  got 2
> t2.rdd.partitions.size  —>  got 200
>
>  First questions, why does t2’s partition size becomes 200?
>
>  Second questions, even if I do t2.repartition(10).collect,  in some
> stages, it still fires 200 tasks.
>
>  Thanks,
>
>  -Neal
>
>
>
>


Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map
value (in the schema, you can see "fields: map"). The syntax of
fields.driver is used for struct type.

On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco 
wrote:

> Hi,
>
> I'm new with both Cassandra and Spark and am experimenting with what Spark
> SQL can do as it will affect my Cassandra data model.
>
> What I need is a model that can accept arbitrary fields, similar to
> Postgres's Hstore. Right now, I'm trying out the map type in Cassandra but
> I'm getting the exception below when running my Spark SQL:
>
> java.lang.RuntimeException: Can't access nested field in type
> MapType(StringType,StringType,true)
>
> The schema I have now is:
> root
>  |-- device_id: integer (nullable = true)
>  |-- event_date: string (nullable = true)
>  |-- fields: map (nullable = true)
>  ||-- key: string
>  ||-- value: string (valueContainsNull = true)
>
> And my Spark SQL is:
> SELECT fields from raw_device_data where fields.driver = 'driver1'
>
> From what I gather, this should work for a JSON based RDD
> (
> https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
> ).
>
> Is this not supported for a Cassandra map type?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-query-key-value-in-Map-tp22517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Oh, just noticed that I missed "attach"... Yeah, your scripts will be
helpful. Thanks!

On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda <
ar...@sigmoidanalytics.com> wrote:

> Yes, i am able to reproduce the problem. Do you need the scripts to create
> the tables?
>
> On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai  wrote:
>
>> Can your code that can reproduce the problem?
>>
>> On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>> Hi
>>>
>>> As per JIRA this issue is resolved, but i am still facing this issue.
>>>
>>> SPARK-2734 - DROP TABLE should also uncache table
>>>
>>>
>>> --
>>>
>>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>>
>>> *Arush Kharbanda* || Technical Teamlead
>>>
>>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>>
>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>


Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem?

On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda  wrote:

> Hi
>
> As per JIRA this issue is resolved, but i am still facing this issue.
>
> SPARK-2734 - DROP TABLE should also uncache table
>
>
> --
>
> [image: Sigmoid Analytics] 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>


Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
I think the slowness is caused by the way that we serialize/deserialize the
value of a complex type. I have opened
https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement.

On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip  wrote:

> The schema has a StructType.
>
> Justin
>
> On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai  wrote:
>
>> Hi Justin,
>>
>> Does the schema of your data have any decimal, array, map, or struct type?
>>
>> Thanks,
>>
>> Yin
>>
>> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip 
>> wrote:
>>
>>> Hello,
>>>
>>> I have a parquet file of around 55M rows (~ 1G on disk). Performing
>>> simple grouping operation is pretty efficient (I get results within 10
>>> seconds). However, after called DataFrame.cache, I observe a significant
>>> performance degrade, the same operation now takes 3+ minutes.
>>>
>>> My hunch is that DataFrame cannot leverage its columnar format after
>>> persisting in memory. But cannot find anywhere from the doc mentioning this.
>>>
>>> Did I miss anything?
>>>
>>> Thanks!
>>>
>>> Justin
>>>
>>
>>
>


Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin,

Does the schema of your data have any decimal, array, map, or struct type?

Thanks,

Yin

On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip  wrote:

> Hello,
>
> I have a parquet file of around 55M rows (~ 1G on disk). Performing simple
> grouping operation is pretty efficient (I get results within 10 seconds).
> However, after called DataFrame.cache, I observe a significant performance
> degrade, the same operation now takes 3+ minutes.
>
> My hunch is that DataFrame cannot leverage its columnar format after
> persisting in memory. But cannot find anywhere from the doc mentioning this.
>
> Did I miss anything?
>
> Thanks!
>
> Justin
>


Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example,
df.selectExpr("cast(col1 as int) as col1", "cast(col2 as bigint) as col2").
Or, df.select(df("colA").cast("int"), ...)

On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust 
wrote:

> val df = Seq(("test", 1)).toDF("col1", "col2")
>
> You can use SQL style expressions as a string:
>
> df.filter("col1 IS NOT NULL").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([test,1])
>
> Or you can also reference columns using df("colName") or quot;colName" or
> col("colName")
>
> df.filter(df("col1") === "test").collect()
> res2: Array[org.apache.spark.sql.Row] = Array([test,1])
>
> On Thu, Apr 2, 2015 at 7:45 PM, Yana Kadiyska 
> wrote:
>
>> Hi folks, having some seemingly noob issues with the dataframe API.
>>
>> I have a DF which came from the csv package.
>>
>> 1. What would be an easy way to cast a column to a given type -- my DF
>> columns are all typed as strings coming from a csv. I see a schema getter
>> but not setter on DF
>>
>> 2. I am trying to use the syntax used in various blog posts but can't
>> figure out how to reference a column by name:
>>
>> scala> df.filter("customer_id"!="")
>> :23: error: overloaded method value filter with alternatives:
>>   (conditionExpr: String)org.apache.spark.sql.DataFrame 
>>   (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
>>  cannot be applied to (Boolean)
>>   df.filter("customer_id"!="")
>>
>> ​
>> 3. what would be the recommended way to drop a row containing a null
>> value -- is it possible to do this:
>> scala> df.filter("customer_id" IS NOT NULL)
>>
>>
>>
>>
>


Re: rdd.toDF().saveAsParquetFile("tachyon://host:19998/test")

2015-03-27 Thread Yin Huai
You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has
been fixed in 1.3.1, which will be released soon.

On Fri, Mar 27, 2015 at 10:42 PM, sud_self <852677...@qq.com> wrote:

> spark version is 1.3.0 with tanhyon-0.6.1
>
> QUESTION DESCRIPTION: rdd.saveAsObjectFile("tachyon://host:19998/test")
> and
> rdd.saveAsTextFile("tachyon://host:19998/test")  succeed,   but
> rdd.toDF().saveAsParquetFile("tachyon://host:19998/test") failure.
>
> ERROR MESSAGE:java.lang.IllegalArgumentException: Wrong FS:
> tachyon://host:19998/test, expected: hdfs://host:8020
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at
> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:465)
> at
>
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
> at
>
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-toDF-saveAsParquetFile-tachyon-host-19998-test-tp22264.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Date and decimal datatype not working

2015-03-23 Thread Yin Huai
To store to csv file, you can use Spark-CSV
<https://github.com/databricks/spark-csv> library.

On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA  wrote:

>  Thanks. This worked well as per your suggestions. I had to run following:
>
> val TABLE_A =
> sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p
> => ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)),
> BigDecimal(p(5)), BigDecimal(p(6
>
>
>
> Now I am stuck at another step. I have run a SQL query, where I am
> Selecting from all the fields with some where clause , TSTAMP filtered with
> date range and order by TSTAMP clause. That is running fine.
>
>
>
> Then I am trying to store the output in a CSV file. I am using
> saveAsTextFile(“filename”) function. But it is giving error. Can you please
> help me to write a proper syntax to store output in a CSV file?
>
>
>
>
>
> Thanks & Regards
>
> ---
>
> Ananda Basak
>
> Ph: 425-213-7092
>
>
>
> *From:* BASAK, ANANDA
> *Sent:* Tuesday, March 17, 2015 3:08 PM
> *To:* Yin Huai
> *Cc:* user@spark.apache.org
> *Subject:* RE: Date and decimal datatype not working
>
>
>
> Ok, thanks for the suggestions. Let me try and will confirm all.
>
>
>
> Regards
>
> Ananda
>
>
>
> *From:* Yin Huai [mailto:yh...@databricks.com]
> *Sent:* Tuesday, March 17, 2015 3:04 PM
> *To:* BASAK, ANANDA
> *Cc:* user@spark.apache.org
> *Subject:* Re: Date and decimal datatype not working
>
>
>
> p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
> p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
> value, you need to create BigDecimal objects from your String values.
>
>
>
> On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA  wrote:
>
>   Hi All,
>
> I am very new in Spark world. Just started some test coding from last
> week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.
>
> I am having issues while using Date and decimal data types. Following is
> my code that I am simply running on scala prompt. I am trying to define a
> table and point that to my flat file containing raw data (pipe delimited
> format). Once that is done, I will run some SQL queries and put the output
> data in to another flat file with pipe delimited format.
>
>
>
> ***
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
>
>
>
>
> // Define row and table
>
> case class ROW_A(
>
>   TSTAMP:   Long,
>
>   USIDAN: String,
>
>   SECNT:Int,
>
>   SECT:   String,
>
>   BLOCK_NUM:BigDecimal,
>
>   BLOCK_DEN:BigDecimal,
>
>   BLOCK_PCT:BigDecimal)
>
>
>
> val TABLE_A =
> sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p
> => ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))
>
>
>
> TABLE_A.registerTempTable("TABLE_A")
>
>
>
> ***
>
>
>
> The second last command is giving error, like following:
>
> :17: error: type mismatch;
>
> found   : String
>
> required: Long
>
>
>
> Looks like the content from my flat file are considered as String always
> and not as Date or decimal. How can I make Spark to take them as Date or
> decimal types?
>
>
>
> Regards
>
> Ananda
>
>
>


Re: Use pig load function in spark

2015-03-23 Thread Yin Huai
Hello Kevin,

You can take a look at our generic load function

.

For example, you can use

val df = sqlContext.load("/myData", "parquet")

To load a parquet dataset stored in "/myData" as a DataFrame
.

You can use it to load data stored in various formats, like json (Spark
built-in), parquet (Spark built-in), avro
, and csv
.

Thanks,

Yin

On Mon, Mar 23, 2015 at 7:14 PM, Dai, Kevin  wrote:

>  Hi, Paul
>
>
>
> You are right.
>
>
>
> The story is that we have a lot of pig load function to load our different
> data.
>
>
>
> And now we want to use spark to read and process these data.
>
>
>
> So we want to figure out a way to reuse our existing load function in
> spark to read these data.
>
>
>
> Any idea?
>
>
>
> Best Regards,
>
> Kevin.
>
>
>
> *From:* Paul Brown [mailto:p...@mult.ifario.us]
> *Sent:* 2015年3月24日 4:11
> *To:* Dai, Kevin
> *Subject:* Re: Use pig load function in spark
>
>
>
>
>
> The answer is "Maybe, but you probably don't want to do that.".
>
>
>
> A typical Pig load function is devoted to bridging external data into
> Pig's type system, but you don't really need to do that in Spark because it
> is (thankfully) not encumbered by Pig's type system.  What you probably
> want to do is to figure out a way to use native Spark facilities (e.g.,
> textFile) coupled with some of the logic out of your Pig load function
> necessary to turn your external data into an RDD.
>
>
>
>
>   —
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
>
> On Mon, Mar 23, 2015 at 2:29 AM, Dai, Kevin  wrote:
>
> Hi, all
>
>
>
> Can spark use pig’s load function to load data?
>
>
>
> Best Regards,
>
> Kevin.
>
>
>


Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yin Huai
spark.sql.shuffle.partitions only control the number of tasks in the second
stage (the number of reducers). For your case, I'd say that the number of
tasks in the first state (number of mappers) will be the number of files
you have.

Actually, have you changed "spark.executor.memory" (it controls the memory
for an executor of your application)? I did not see it in your original
email. The difference between worker memory and executor memory can be
found at (http://spark.apache.org/docs/1.3.0/spark-standalone.html),

SPARK_WORKER_MEMORY
Total amount of memory to allow Spark applications to use on the machine,
e.g. 1000m, 2g (default: total memory minus 1 GB); note that each
application's individual memory is configured using its
spark.executor.memory property.


On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas 
wrote:

> Actually I realized that the correct way is:
>
> sqlContext.sql("set spark.sql.shuffle.partitions=1000")
>
> but I am still experiencing the same behavior/error.
>
> On 20 March 2015 at 16:04, Yiannis Gkoufas  wrote:
>
>> Hi Yin,
>>
>> the way I set the configuration is:
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> sqlContext.setConf("spark.sql.shuffle.partitions","1000");
>>
>> it is the correct way right?
>> In the mapPartitions task (the first task which is launched), I get again
>> the same number of tasks and again the same error. :(
>>
>> Thanks a lot!
>>
>> On 19 March 2015 at 17:40, Yiannis Gkoufas  wrote:
>>
>>> Hi Yin,
>>>
>>> thanks a lot for that! Will give it a shot and let you know.
>>>
>>> On 19 March 2015 at 16:30, Yin Huai  wrote:
>>>
>>>> Was the OOM thrown during the execution of first stage (map) or the
>>>> second stage (reduce)? If it was the second stage, can you increase the
>>>> value of spark.sql.shuffle.partitions and see if the OOM disappears?
>>>>
>>>> This setting controls the number of reduces Spark SQL will use and the
>>>> default is 200. Maybe there are too many distinct values and the memory
>>>> pressure on every task (of those 200 reducers) is pretty high. You can
>>>> start with 400 and increase it until the OOM disappears. Hopefully this
>>>> will help.
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>>
>>>> On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas 
>>>> wrote:
>>>>
>>>>> Hi Yin,
>>>>>
>>>>> Thanks for your feedback. I have 1700 parquet files, sized 100MB each.
>>>>> The number of tasks launched is equal to the number of parquet files. Do
>>>>> you have any idea on how to deal with this situation?
>>>>>
>>>>> Thanks a lot
>>>>> On 18 Mar 2015 17:35, "Yin Huai"  wrote:
>>>>>
>>>>>> Seems there are too many distinct groups processed in a task, which
>>>>>> trigger the problem.
>>>>>>
>>>>>> How many files do your dataset have and how large is a file? Seems
>>>>>> your query will be executed with two stages, table scan and map-side
>>>>>> aggregation in the first stage and the final round of reduce-side
>>>>>> aggregation in the second stage. Can you take a look at the numbers of
>>>>>> tasks launched in these two stages?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Yin
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas <
>>>>>> johngou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi there, I set the executor memory to 8g but it didn't help
>>>>>>>
>>>>>>> On 18 March 2015 at 13:59, Cheng Lian  wrote:
>>>>>>>
>>>>>>>> You should probably increase executor memory by setting
>>>>>>>> "spark.executor.memory".
>>>>>>>>
>>>>>>>> Full list of available configurations can be found here
>>>>>>>> http://spark.apache.org/docs/latest/configuration.html
>>>>>>>>
>>>>>>>> Cheng
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I was trying the new DataFrame API with some basic operations on a
>>>>>>>>> parquet dataset.
>>>>>>>>> I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in
>>>>>>>>> a standalone cluster mode.
>>>>>>>>> The code is the following:
>>>>>>>>>
>>>>>>>>> val people = sqlContext.parquetFile("/data.parquet");
>>>>>>>>> val res = people.groupBy("name","date").
>>>>>>>>> agg(sum("power"),sum("supply")).take(10);
>>>>>>>>> System.out.println(res);
>>>>>>>>>
>>>>>>>>> The dataset consists of 16 billion entries.
>>>>>>>>> The error I get is java.lang.OutOfMemoryError: GC overhead limit
>>>>>>>>> exceeded
>>>>>>>>>
>>>>>>>>> My configuration is:
>>>>>>>>>
>>>>>>>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>>>>>>>> spark.driver.memory6g
>>>>>>>>> spark.executor.extraJavaOptions -XX:+UseCompressedOops
>>>>>>>>> spark.shuffle.managersort
>>>>>>>>>
>>>>>>>>> Any idea how can I workaround this?
>>>>>>>>>
>>>>>>>>> Thanks a lot
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>


Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai
Can you add your code snippet? Seems it's missing in the original email.

Thanks,

Yin

On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka  wrote:

> I'm trying to filter a DataFrame by a date column, with no luck so far.
> Here's what I'm doing:
>
>
>
> When I run reqs_day.count() I get zero, apparently because my date
> parameter
> gets translated to 16509.
>
> Is this a bug, or am I doing it wrong?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-filter-DataFrame-by-date-tp22149.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai
Was the OOM thrown during the execution of first stage (map) or the second
stage (reduce)? If it was the second stage, can you increase the value
of spark.sql.shuffle.partitions and see if the OOM disappears?

This setting controls the number of reduces Spark SQL will use and the
default is 200. Maybe there are too many distinct values and the memory
pressure on every task (of those 200 reducers) is pretty high. You can
start with 400 and increase it until the OOM disappears. Hopefully this
will help.

Thanks,

Yin


On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas 
wrote:

> Hi Yin,
>
> Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The
> number of tasks launched is equal to the number of parquet files. Do you
> have any idea on how to deal with this situation?
>
> Thanks a lot
> On 18 Mar 2015 17:35, "Yin Huai"  wrote:
>
>> Seems there are too many distinct groups processed in a task, which
>> trigger the problem.
>>
>> How many files do your dataset have and how large is a file? Seems your
>> query will be executed with two stages, table scan and map-side aggregation
>> in the first stage and the final round of reduce-side aggregation in the
>> second stage. Can you take a look at the numbers of tasks launched in these
>> two stages?
>>
>> Thanks,
>>
>> Yin
>>
>> On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
>> wrote:
>>
>>> Hi there, I set the executor memory to 8g but it didn't help
>>>
>>> On 18 March 2015 at 13:59, Cheng Lian  wrote:
>>>
>>>> You should probably increase executor memory by setting
>>>> "spark.executor.memory".
>>>>
>>>> Full list of available configurations can be found here
>>>> http://spark.apache.org/docs/latest/configuration.html
>>>>
>>>> Cheng
>>>>
>>>>
>>>> On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I was trying the new DataFrame API with some basic operations on a
>>>>> parquet dataset.
>>>>> I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
>>>>> standalone cluster mode.
>>>>> The code is the following:
>>>>>
>>>>> val people = sqlContext.parquetFile("/data.parquet");
>>>>> val res = people.groupBy("name","date").agg(sum("power"),sum("supply")
>>>>> ).take(10);
>>>>> System.out.println(res);
>>>>>
>>>>> The dataset consists of 16 billion entries.
>>>>> The error I get is java.lang.OutOfMemoryError: GC overhead limit
>>>>> exceeded
>>>>>
>>>>> My configuration is:
>>>>>
>>>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>>>> spark.driver.memory6g
>>>>> spark.executor.extraJavaOptions -XX:+UseCompressedOops
>>>>> spark.shuffle.managersort
>>>>>
>>>>> Any idea how can I workaround this?
>>>>>
>>>>> Thanks a lot
>>>>>
>>>>
>>>>
>>>
>>


Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have created https://issues.apache.org/jira/browse/SPARK-6413 to track the
improvement on the output of DESCRIBE statement.

On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai  wrote:

> Hi Christian,
>
> Your table is stored correctly in Parquet format.
>
> For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
> data source table (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
> We are only using Hive's metastore to store the metadata (to be specific,
> only table properties and serde properties). When you look at table
> property, there will be a field called "spark.sql.sources.provider" and the
> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
> look at your files in the file system. They are stored by Parquet.
>
> Thanks,
>
> Yin
>
> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez 
> wrote:
>
>> Hi all,
>>
>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>> schema _and_ storage format in the Hive metastore, so that the table
>> cannot be read from inside Hive. Spark itself can read the table, but
>> Hive throws a Serialization error because it doesn't know it is
>> Parquet.
>>
>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
>> df.saveAsTable("spark_test_foo")
>>
>> Expected:
>>
>> COLUMNS(
>>   education BIGINT,
>>   income BIGINT
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>
>> Actual:
>>
>> COLUMNS(
>>   col array COMMENT "from deserializer"
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>
>> ---
>>
>> Manually changing schema and storage restores access in Hive and
>> doesn't affect Spark. Note also that Hive's table property
>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>> the schema data is serialized when sent to Hive but not deserialized
>> properly on receive.
>>
>> I'm tracing execution through source code... but before I get any
>> deeper, can anyone reproduce this behavior?
>>
>> Cheers,
>>
>> Christian
>>
>> --
>> Christian Perez
>> Silicon Valley Data Science
>> Data Analyst
>> christ...@svds.com
>> @cp_phd
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian,

Your table is stored correctly in Parquet format.

For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to be specific,
only table properties and serde properties). When you look at table
property, there will be a field called "spark.sql.sources.provider" and the
value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
look at your files in the file system. They are stored by Parquet.

Thanks,

Yin

On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez 
wrote:

> Hi all,
>
> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
> schema _and_ storage format in the Hive metastore, so that the table
> cannot be read from inside Hive. Spark itself can read the table, but
> Hive throws a Serialization error because it doesn't know it is
> Parquet.
>
> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
> df.saveAsTable("spark_test_foo")
>
> Expected:
>
> COLUMNS(
>   education BIGINT,
>   income BIGINT
> )
>
> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>
> Actual:
>
> COLUMNS(
>   col array COMMENT "from deserializer"
> )
>
> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>
> ---
>
> Manually changing schema and storage restores access in Hive and
> doesn't affect Spark. Note also that Hive's table property
> "spark.sql.sources.schema" is correct. At first glance, it looks like
> the schema data is serialized when sent to Hive but not deserialized
> properly on receive.
>
> I'm tracing execution through source code... but before I get any
> deeper, can anyone reproduce this behavior?
>
> Cheers,
>
> Christian
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yin Huai
Seems there are too many distinct groups processed in a task, which trigger
the problem.

How many files do your dataset have and how large is a file? Seems your
query will be executed with two stages, table scan and map-side aggregation
in the first stage and the final round of reduce-side aggregation in the
second stage. Can you take a look at the numbers of tasks launched in these
two stages?

Thanks,

Yin

On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
wrote:

> Hi there, I set the executor memory to 8g but it didn't help
>
> On 18 March 2015 at 13:59, Cheng Lian  wrote:
>
>> You should probably increase executor memory by setting
>> "spark.executor.memory".
>>
>> Full list of available configurations can be found here
>> http://spark.apache.org/docs/latest/configuration.html
>>
>> Cheng
>>
>>
>> On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
>>
>>> Hi there,
>>>
>>> I was trying the new DataFrame API with some basic operations on a
>>> parquet dataset.
>>> I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
>>> standalone cluster mode.
>>> The code is the following:
>>>
>>> val people = sqlContext.parquetFile("/data.parquet");
>>> val res = people.groupBy("name","date").agg(sum("power"),sum("supply")
>>> ).take(10);
>>> System.out.println(res);
>>>
>>> The dataset consists of 16 billion entries.
>>> The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> My configuration is:
>>>
>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>> spark.driver.memory6g
>>> spark.executor.extraJavaOptions -XX:+UseCompressedOops
>>> spark.shuffle.managersort
>>>
>>> Any idea how can I workaround this?
>>>
>>> Thanks a lot
>>>
>>
>>
>


Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai
Hi Roberto,

For now, if the "timestamp" is a top level column (not a field in a
struct), you can use use backticks to quote the column name like `timestamp
`.

Thanks,

Yin

On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio <
roberto.coluc...@gmail.com> wrote:

> Hey Cheng, thank you so much for your suggestion, the problem was actually
> a column/field called "timestamp" in one of the case classes!! Once I
> changed its name everything worked out fine again. Let me say it was kinda
> frustrating ...
>
> Roberto
>
> On Wed, Mar 18, 2015 at 4:07 PM, Roberto Coluccio <
> roberto.coluc...@gmail.com> wrote:
>
>> You know, I actually have one of the columns called "timestamp" ! This
>> may really cause the problem reported in the bug you linked, I guess.
>>
>> On Wed, Mar 18, 2015 at 3:37 PM, Cheng Lian 
>> wrote:
>>
>>>  I suspect that you hit this bug
>>> https://issues.apache.org/jira/browse/SPARK-6250, it depends on the
>>> actual contents of your query.
>>>
>>> Yin had opened a PR for this, although not merged yet, it should be a
>>> valid fix https://github.com/apache/spark/pull/5078
>>>
>>> This fix will be included in 1.3.1.
>>>
>>> Cheng
>>>
>>> On 3/18/15 10:04 PM, Roberto Coluccio wrote:
>>>
>>> Hi Cheng, thanks for your reply.
>>>
>>>  The query is something like:
>>>
>>>  SELECT * FROM (
   SELECT m.column1, IF (d.columnA IS NOT null, d.columnA, m.column2),
 ..., m.columnN FROM tableD d RIGHT OUTER JOIN tableM m on m.column2 =
 d.columnA WHERE m.column2!=\"None\" AND d.columnA!=\"\"
   UNION ALL
   SELECT ... [another SELECT statement with different conditions but
 same tables]
   UNION ALL
   SELECT ... [another SELECT statement with different conditions but
 same tables]
 ) a
>>>
>>>
>>>  I'm using just sqlContext, no hiveContext. Please, note once again
>>> that this perfectly worked w/ Spark 1.1.x.
>>>
>>>  The tables, i.e. tableD and tableM are previously registered with the
>>> RDD.registerTempTable method, where the input RDDs are actually a
>>> RDD[MyCaseClassM/D], with MyCaseClassM and MyCaseClassD being simple
>>> case classes with only (and less than 22) String fields.
>>>
>>>  Hope the situation is a bit more clear. Thanks anyone who will help me
>>> out here.
>>>
>>>  Roberto
>>>
>>>
>>>
>>> On Wed, Mar 18, 2015 at 12:09 PM, Cheng Lian 
>>> wrote:
>>>
  Would you mind to provide the query? If it's confidential, could you
 please help constructing a query that reproduces this issue?

 Cheng

 On 3/18/15 6:03 PM, Roberto Coluccio wrote:

 Hi everybody,

  When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both
 1.2.0 and 1.2.1) I encounter a weird error never occurred before about
 which I'd kindly ask for any possible help.

   In particular, all my Spark SQL queries fail with the following
 exception:

  java.lang.RuntimeException: [1.218] failure: identifier expected
>
> [my query listed]
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
>   at
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
>   at
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
>   at
> org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
>   at
> org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
>   at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   ...



  The unit tests I've got for testing this stuff fail both if I
 build+test the project with Maven and if I run then as single ScalaTest
 files or test suites/packages.

  When running my app as usual on EMR in YARN-cluster mode, I get the
 following:

  15/03/17 11:32:14 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: [1.218] failure: 
 identifier expected

 SELECT * FROM ... (my query)


^)
 Exception in thread "Driver" java.lang.RuntimeException: [1.218] failure: 
 identifier expected

 SELECT * FROM ... (my query)   
   

Re: Date and decimal datatype not working

2015-03-17 Thread Yin Huai
p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
value, you need to create BigDecimal objects from your String values.

On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA  wrote:

>  Hi All,
>
> I am very new in Spark world. Just started some test coding from last
> week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.
>
> I am having issues while using Date and decimal data types. Following is
> my code that I am simply running on scala prompt. I am trying to define a
> table and point that to my flat file containing raw data (pipe delimited
> format). Once that is done, I will run some SQL queries and put the output
> data in to another flat file with pipe delimited format.
>
>
>
> ***
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
>
>
>
>
> // Define row and table
>
> case class ROW_A(
>
>   TSTAMP:   Long,
>
>   USIDAN: String,
>
>   SECNT:Int,
>
>   SECT:   String,
>
>   BLOCK_NUM:BigDecimal,
>
>   BLOCK_DEN:BigDecimal,
>
>   BLOCK_PCT:BigDecimal)
>
>
>
> val TABLE_A =
> sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p
> => ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))
>
>
>
> TABLE_A.registerTempTable("TABLE_A")
>
>
>
> ***
>
>
>
> The second last command is giving error, like following:
>
> :17: error: type mismatch;
>
> found   : String
>
> required: Long
>
>
>
> Looks like the content from my flat file are considered as String always
> and not as Date or decimal. How can I make Spark to take them as Date or
> decimal types?
>
>
>
> Regards
>
> Ananda
>


Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
Initially, an attribute reference (column reference), like selecting a
column from a table, is not resolved since we do not know if the reference
is valid or not (if this column exists in the underlying table). In the
query compilation process, we will first analyze this query and resolved
those attribute references. A resolved attribute reference means that this
reference is valid and we know where to get the column values from the
input. Hope this is helpful.

On Tue, Mar 17, 2015 at 2:19 PM, Ophir Cohen  wrote:

> Thanks you for the answer and one more question: what does it mean
> 'resolved attribute'?
> On Mar 17, 2015 8:14 PM, "Yin Huai"  wrote:
>
>> The number is an id we used internally to identify an resolved Attribute.
>> Looks like basic_null_diluted_d was not resolved since there is no id
>> associated with it.
>>
>> On Tue, Mar 17, 2015 at 2:08 PM, Ophir Cohen  wrote:
>>
>>> Interesting, I thought the problem is with the method itself.
>>> I will check it soon and update.
>>> Can you elaborate what does it mean the # and the number? Is that a
>>> reference to the field in the rdd?
>>> Thank you,
>>> Ophir
>>> On Mar 17, 2015 7:06 PM, "Yin Huai"  wrote:
>>>
>>>> Seems "basic_null_diluted_d" was not resolved? Can you check if
>>>> basic_null_diluted_d is in you table?
>>>>
>>>> On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen  wrote:
>>>>
>>>>> Hi Guys,
>>>>> I'm registering a function using:
>>>>> sqlc.registerFunction("makeEstEntry",ReutersDataFunctions.makeEstEntry
>>>>> _)
>>>>>
>>>>> Then I register the table and try to query the table using that
>>>>> function and I get:
>>>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>>>>> Unresolved attributes:
>>>>> 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
>>>>> AS FY0#2837, tree:
>>>>>
>>>>> Thanks!
>>>>> Ophir
>>>>>
>>>>
>>>>
>>


Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
The number is an id we used internally to identify an resolved Attribute.
Looks like basic_null_diluted_d was not resolved since there is no id
associated with it.

On Tue, Mar 17, 2015 at 2:08 PM, Ophir Cohen  wrote:

> Interesting, I thought the problem is with the method itself.
> I will check it soon and update.
> Can you elaborate what does it mean the # and the number? Is that a
> reference to the field in the rdd?
> Thank you,
> Ophir
> On Mar 17, 2015 7:06 PM, "Yin Huai"  wrote:
>
>> Seems "basic_null_diluted_d" was not resolved? Can you check if
>> basic_null_diluted_d is in you table?
>>
>> On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen  wrote:
>>
>>> Hi Guys,
>>> I'm registering a function using:
>>> sqlc.registerFunction("makeEstEntry",ReutersDataFunctions.makeEstEntry _)
>>>
>>> Then I register the table and try to query the table using that function
>>> and I get:
>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>>> Unresolved attributes:
>>> 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
>>> AS FY0#2837, tree:
>>>
>>> Thanks!
>>> Ophir
>>>
>>
>>


Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
Seems "basic_null_diluted_d" was not resolved? Can you check if
basic_null_diluted_d is in you table?

On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen  wrote:

> Hi Guys,
> I'm registering a function using:
> sqlc.registerFunction("makeEstEntry",ReutersDataFunctions.makeEstEntry _)
>
> Then I register the table and try to query the table using that function
> and I get:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes:
> 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
> AS FY0#2837, tree:
>
> Thanks!
> Ophir
>


Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai
Seems you want to use array for the field of "providers", like
"providers":[{"id":
...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}}

On Fri, Mar 13, 2015 at 7:45 PM, kpeng1  wrote:

> Hi All,
>
> I was noodling around with loading in a json file into spark sql's hive
> context and I noticed that I get the following message after loading in the
> json file:
> PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47
>
> I am using the HiveContext to load in the json file using the jsonFile
> command.  I also have 1 json object per line on the file.  Here is a sample
> of the contents in the json file:
>
> {"user_id":"7070","providers":{{"id":"8753","name":"pjfig","behaviors":{"b1":"erwxt","b2":"yjooj"}},{"id":"8329","name":"dfvhh","behaviors":{"b1":"pjjdn","b2":"ooqsh"
>
> {"user_id":"1615","providers":{{"id":"6105","name":"rsfon","behaviors":{"b1":"whlje","b2":"lpjnq"}},{"id":"6828","name":"pnmrb","behaviors":{"b1":"fjpmz","b2":"dxqxk"
>
> {"user_id":"5210","providers":{{"id":"9360","name":"xdylm","behaviors":{"b1":"gcdze","b2":"cndcs"}},{"id":"4812","name":"gxboh","behaviors":{"b1":"qsxao","b2":"ixdzq"
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-in-json-with-spark-sql-tp22044.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL. Cast to Bigint

2015-03-13 Thread Yin Huai
Are you using SQLContext? Right now, the parser in the SQLContext is quite
limited on the data type keywords that it handles (see here
)
and unfortunately "bigint" is not handled in it right now. We will add
other data types in there (https://issues.apache.org/jira/browse/SPARK-6146
is used to track it).  Can you try HiveContext for now?

On Fri, Mar 13, 2015 at 4:48 AM, Masf  wrote:

> Hi.
>
> I have a query in Spark SQL and I can not covert a value to BIGINT:
> CAST(column AS BIGINT) or
> CAST(0 AS BIGINT)
>
> The output is:
> Exception in thread "main" java.lang.RuntimeException: [34.62] failure:
> ``DECIMAL'' expected but identifier BIGINT found
>
> Thanks!!
> Regards.
> Miguel Ángel
>


Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai
Hi Nitay,

Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
"struct<`int`:bigint>")?

Thanks,

Yin

On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust 
wrote:

> Thanks for reporting.  This was a result of a change to our DDL parser
> that resulted in types becoming reserved words.  I've filled a JIRA and
> will investigate if this is something we can fix.
> https://issues.apache.org/jira/browse/SPARK-6250
>
> On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe  wrote:
>
>> In Spark 1.2 I used to be able to do this:
>>
>> scala>
>> org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType("struct")
>> res30: org.apache.spark.sql.catalyst.types.DataType =
>> StructType(List(StructField(int,LongType,true)))
>>
>> That is, the name of a column can be a keyword like "int". This is no
>> longer the case in 1.3:
>>
>> data-pipeline-shell> HiveTypeHelper.toDataType("struct")
>> org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
>> failure: ``>'' expected but `int' found
>>
>> struct
>>^
>> at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
>> at
>> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
>> at
>> org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)
>>
>> Note HiveTypeHelper is simply an object I load in to expose
>> HiveMetastoreTypes since it was made private. See
>> https://gist.github.com/nitay/460b41ed5fd7608507f5
>> 
>>
>> This is actually a pretty big problem for us as we have a bunch of legacy
>> tables with column names like "timestamp". They work fine in 1.2, but now
>> everything throws in 1.3.
>>
>> Any thoughts?
>>
>> Thanks,
>> - Nitay
>> Founder & CTO
>>
>>
>


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli  wrote:

>  Does Spark-SQL require installation of Hive for it to run correctly or
> not?
>
> I could not tell from this statement:
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
>
> Thank you,
> Edmon
>


Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
@Shahab, based on https://issues.apache.org/jira/browse/HIVE-5472,
current_date was added in Hive *1.2.0 (not 0.12.0)*. For my previous email,
I meant current_date is not in neither Hive 0.12.0 nor Hive 0.13.1 (Spark
SQL currently supports these two Hive versions).

On Tue, Mar 3, 2015 at 8:55 AM, Rohit Rai  wrote:

> The Hive dependency comes from spark-hive.
>
> It does work with Spark 1.1 we will have the 1.2 release later this month.
> On Mar 3, 2015 8:49 AM, "shahab"  wrote:
>
>>
>> Thanks Rohit,
>>
>> I am already using Calliope and quite happy with it, well done ! except
>> the fact that :
>> 1- It seems that it does not support Hive 0.12 or higher, Am i right?
>>  for example you can not use : current_time() UDF, or those new UDFs added
>> in hive 0.12 . Are they supported? Any plan for supporting them?
>> 2-It does not support Spark 1.1 and 1.2. Any plan for new release?
>>
>> best,
>> /Shahab
>>
>> On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai  wrote:
>>
>>> Hello Shahab,
>>>
>>> I think CassandraAwareHiveContext
>>> 
>>>  in
>>> Calliopee is what you are looking for. Create CAHC instance and you should
>>> be able to run hive functions against the SchemaRDD you create from there.
>>>
>>> Cheers,
>>> Rohit
>>>
>>> *Founder & CEO, **Tuplejump, Inc.*
>>> 
>>> www.tuplejump.com
>>> *The Data Engineering Platform*
>>>
>>> On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao  wrote:
>>>
  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the "rdd" as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao  wrote:

  Hive UDF are only applicable for HiveContext and its subclass
 instance, is the CassandraAwareSQLContext a direct sub class of
 HiveContext or SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql("select * from db.profile " )

 rdd.cache

 rdd.registerTempTable("profile")

  rdd.first  //enforce caching

  val q = "select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 "

  val rdd2 = rdd.sqlContext.sql(q )

  println ("Result: " + rdd2.first)



 And I get the following  errors:

 xception in thread "main"
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling,
 profile, org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None,
 None, false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at sca

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1
(versions that we support). Seems
https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently.

On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao  wrote:

>  The temp table in metastore can not be shared cross SQLContext
> instances, since HiveContext is a sub class of SQLContext (inherits all of
> its functionality), why not using a single HiveContext globally? Is there
> any specific requirement in your case that you need multiple
> SQLContext/HiveContext?
>
>
>
> *From:* shahab [mailto:shahab.mok...@gmail.com]
> *Sent:* Tuesday, March 3, 2015 9:46 PM
>
> *To:* Cheng, Hao
> *Cc:* user@spark.apache.org
> *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server
>
>
>
> You are right ,  CassandraAwareSQLContext is subclass of SQL context.
>
>
>
> But I did another experiment, I queried Cassandra
> using CassandraAwareSQLContext, then I registered the "rdd" as a temp table
> , next I tried to query it using HiveContext, but it seems that hive
> context can not see the registered table suing SQL context. Is this a
> normal case?
>
>
>
> best,
>
> /Shahab
>
>
>
>
>
> On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao  wrote:
>
>  Hive UDF are only applicable for HiveContext and its subclass instance,
> is the CassandraAwareSQLContext a direct sub class of HiveContext or
> SQLContext?
>
>
>
> *From:* shahab [mailto:shahab.mok...@gmail.com]
> *Sent:* Tuesday, March 3, 2015 5:10 PM
> *To:* Cheng, Hao
> *Cc:* user@spark.apache.org
> *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server
>
>
>
>   val sc: SparkContext = new SparkContext(conf)
>
>   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
> Calliope Cassandra Spark connector
>
> val rdd : SchemaRDD  = sqlCassContext.sql("select * from db.profile " )
>
> rdd.cache
>
> rdd.registerTempTable("profile")
>
>  rdd.first  //enforce caching
>
>  val q = "select  from_unixtime(floor(createdAt/1000)) from profile
> where sampling_bucket=0 "
>
>  val rdd2 = rdd.sqlContext.sql(q )
>
>  println ("Result: " + rdd2.first)
>
>
>
> And I get the following  errors:
>
> xception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:
>
> Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]
>
>  Filter (sampling_bucket#10 = 0)
>
>   Subquery profile
>
>Project
> [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]
>
> CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
> org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None,
> false, Some(Configuration: core-default.xml, core-site.xml,
> mapred-default.xml, mapred-site.xml)
>
>
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
>
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)
>
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Yin Huai
Is the string of the above JSON object in the same line? jsonFile requires
that every line is a JSON object or an array of JSON objects.

On Mon, Mar 2, 2015 at 11:28 AM, kpeng1  wrote:

> Hi All,
>
> I am currently having issues reading in a json file using spark sql's api.
> Here is what the json file looks like:
> {
>   "namespace": "spacey",
>   "name": "namer",
>   "type": "record",
>   "fields": [
> {"name":"f1","type":["null","string"]},
> {"name":"f2","type":["null","string"]},
> {"name":"f3","type":["null","string"]},
> {"name":"f4","type":["null","string"]},
> {"name":"f5","type":["null","string"]},
> {"name":"f6","type":["null","string"]},
> {"name":"f7","type":["null","string"]},
> {"name":"f8","type":["null","string"]},
> {"name":"f9","type":["null","string"]},
> {"name":"f10","type":["null","string"]},
> {"name":"f11","type":["null","string"]},
> {"name":"f12","type":["null","string"]},
> {"name":"f13","type":["null","string"]},
> {"name":"f14","type":["null","string"]},
> {"name":"f15","type":["null","string"]}
>   ]
> }
>
> This is what I am doing to read in the json file(using spark sql in the
> spark shell on CDH5.3):
>
> val sqlsc = new org.apache.spark.sql.SQLContext(sc)
> val j = sqlsc.jsonFile("/tmp/try.avsc")
>
>
> This is what I am getting as an error:
>
> 15/03/02 11:23:45 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 12,
> 10.0.2.15): scala.MatchError: namespace (of class java.lang.String)
> at
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:305)
> at
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:303)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
>
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
> at
> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:853)
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:851)
> at
> org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1350)
> at
> org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1350)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.1 in stage 3.0 (TID
> 14, 10.0.2.15, ANY, 1308 bytes)
> 15/03/02 11:23:45 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 13) in 128 ms on 10.0.2.15 (1/2)
> 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.1 in stage 3.0 (TID 14)
> on executor 10.0.2.15: scala.MatchError (namespace (of class
> java.lang.String)) [duplicate 1]
> 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.2 in stage 3.0 (TID
> 15, 10.0.2.15, ANY, 1308 bytes)
> 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.2 in stage 3.0 (TID 15)
> on executor 10.0.2.15: scala.MatchError (namespace (of class
> java.lang.String)) [duplicate 2]
> 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.3 in stage 3.0 (TID
> 16, 10.0.2.15, ANY, 1308 bytes)
> 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.3 in stage 3.0 (TID 16)
> on executor 10.0.2.15: scala.MatchError (namespace (of class
> java.lang.String)) [duplicate 3]
> 15/03/02 11:23:45 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times;
> aborting job
> 15/03/02 11:23:45 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
> have all completed, from pool
> 15/03/02 11:23:45 INFO TaskSchedulerImpl: Cancelling stage 3
> 15/03/02 11:23:45 INFO DAGScheduler: Job 3 failed: reduce at
> JsonRDD.scala:57, took 0.210707 s
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in
> stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0
> (TID 16, 10.0.2.15): scala.MatchError: namespace (of class
> java.lang.String)
> at
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:305)
> at
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:303)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$cl

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been
fixed in 1.2.1 and 1.3.

On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska 
wrote:

> Can someone confirm if they can run UDFs in group by in spark1.2?
>
> I have two builds running -- one from a custom build from early December
> (commit 4259ca8dd12) which works fine, and Spark1.2-RC2.
>
> On the latter I get:
>
>  jdbc:hive2://XXX.208:10001> select 
> from_unixtime(epoch,'-MM-dd-HH'),count(*) count
> . . . . . . . . . . . . . . . . . .> from tbl
> . . . . . . . . . . . . . . . . . .> group by 
> from_unixtime(epoch,'-MM-dd-HH');
> Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Expression not in GROUP BY: 
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
>  AS _c0#1004, tree:
> Aggregate 
> [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)],
>  
> [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
>  AS _c0#1004,COUNT(1) AS count#1003L]
>  MetastoreRelation default, tbl, None (state=,code=0)
>
> ​
>
> This worked fine on my older build. I don't see a JIRA on this but maybe
> I'm not looking right. Can someone please advise?
>
>
>
>


Re: SQLContext.applySchema strictness

2015-02-13 Thread Yin Huai
Hi Justin,

It is expected. We do not check if the provided schema matches rows since
all rows need to be scanned to give a correct answer.

Thanks,

Yin

On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony 
wrote:

> Per the documentation:
>
>   It is important to make sure that the structure of every Row of the
> provided RDD matches the provided schema. Otherwise, there will be runtime
> exception.
>
> However, it appears that this is not being enforced.
>
> import org.apache.spark.sql._
> val sqlContext = new SqlContext(sc)
> val struct = StructType(List(StructField("test", BooleanType, true)))
> val myData = sc.parallelize(List(Row(0), Row(true), Row("stuff")))
> val schemaData = sqlContext.applySchema(myData, struct) //No error
> schemaData.collect()(0).getBoolean(0) //Only now will I receive an error
>
> Is this expected or a bug?
>
> Thanks,
> Justin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai
Regarding backticks: Right. You need backticks to quote the column name
timestamp because timestamp is a reserved keyword in our parser.

On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani 
wrote:

> actually i tried in spark shell , got same error and then for some reason
> i tried to back tick the "timestamp" and it worked.
>  val result = sqlContext.sql("select toSeconds(`timestamp`) as t,
> count(rid) as qps from blah group by toSeconds(`timestamp`),qi.clientName")
>
> so, it seems sql context is supporting UDF.
>
>
>
> On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust 
> wrote:
>
>> The simple SQL parser doesn't yet support UDFs.  Try using a HiveContext.
>>
>> On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani <
>> mohnish.kodn...@gmail.com> wrote:
>>
>>> Hi,
>>> I am trying a very simple registerFunction and it is giving me errors.
>>>
>>> I have a parquet file which I register as temp table.
>>> Then I define a UDF.
>>>
>>> def toSeconds(timestamp: Long): Long = timestamp/10
>>>
>>> sqlContext.registerFunction("toSeconds", toSeconds _)
>>>
>>> val result = sqlContext.sql("select toSeconds(timestamp) from blah");
>>> I get the following error.
>>> java.lang.RuntimeException: [1.18] failure: ``)'' expected but
>>> `timestamp' found
>>>
>>> select toSeconds(timestamp) from blah
>>>
>>> My end goal is as follows:
>>> We have log file with timestamps in microseconds and I would like to
>>> group by entries with second level precision, so eventually I want to run
>>> the query
>>> select toSeconds(timestamp) as t, count(x) from table group by t,x
>>>
>>>
>>>
>>>
>>
>


Re: Can we execute "create table" and "load data" commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and
0.13.1. Can you change the setting of hive.security.authorization.manager
to someone accepted by 0.12 or 0.13.1?

On Thu, Feb 5, 2015 at 11:40 PM, guxiaobo1982  wrote:

> Hi,
>
> I am playing with the following example code:
>
> public class SparkTest {
>
> public static void main(String[] args){
>
>  String appName= "This is a test application";
>
>  String master="spark://lix1.bh.com:7077";
>
>   SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
>
>  JavaSparkContext sc = new JavaSparkContext(conf);
>
>   JavaHiveContext sqlCtx = new
> org.apache.spark.sql.hive.api.java.JavaHiveContext(sc);
>
>  //sqlCtx.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
>
> //sqlCtx.sql("LOAD DATA LOCAL INPATH
> '/opt/spark/examples/src/main/resources/kv1.txt' INTO TABLE src");
>
>  // Queries are expressed in HiveQL.
>
>  List rows = sqlCtx.sql("FROM src SELECT key, value").collect();
>
>  //List rows = sqlCtx.sql("show tables").collect();
>
>   System.out.print("I got " + rows.size() + " rows \r\n");
>
>  sc.close();
>
> }}
>
> With the create table and load data commands commented out, the query
> command can be executed successfully, but I come to ClassNotFoundExceptions
> if these two commands are executed inside HiveContext, even with different
> error messages,
>
> The create table command will cause the following:
>
>
>
>
> Exception in thread "main"
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Hive
> Internal Error:
> java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook)
>
> at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
>
> at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>
> at
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(
> NativeCommand.scala:35)
>
> at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(
> NativeCommand.scala:35)
>
> at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>
> at org.apache.spark.sql.hive.execution.NativeCommand.execute(
> NativeCommand.scala:30)
>
> at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(
> SQLContext.scala:425)
>
> at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(
> SQLContext.scala:425)
>
> at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
>
> at org.apache.spark.sql.api.java.JavaSchemaRDD.(
> JavaSchemaRDD.scala:42)
>
> at org.apache.spark.sql.hive.api.java.JavaHiveContext.sql(
> JavaHiveContext.scala:37)
>
> at com.blackhorse.SparkTest.main(SparkTest.java:24)
>
> [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook
> called
>
> [delete Spark local dirs] DEBUG org.apache.spark.storage.DiskBlockManager
> - Shutdown hook called
>
> The load data command will cause the following:
>
>
>
> Exception in thread "main"
> org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassNotFoundException:
> org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
>
> at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
>
> at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>
> at
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>
> at
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>
> at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>
> at
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>
> at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
>
> at
> org.apache.spark.sql.api.java.JavaSchemaRDD.(JavaSchemaRDD.scala:42)
>
> at
> org.apache.spark.sql.hive.api.java.JavaHiveContext.sql(JavaHiveContext.scala:37)
>
> at com.blackhorse.SparkTest.main(SparkTest.java:25)
>
> [delete Spark local dirs] DEBUG org.apache.spark.storage.DiskBlockManager
> - Shutdown hook called
>
> [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook
> called
>
>
>


Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai
Can you try using backticks to quote the field name? Like `f:price`.

On Tue, Feb 10, 2015 at 5:47 AM, presence2001 
wrote:

> Hi list,
>
> I have some data with a field name of f:price (it's actually part of a JSON
> structure loaded from ElasticSearch via elasticsearch-hadoop connector, but
> I don't think that's significant here). I'm struggling to figure out how to
> express that in a Spark SQL SELECT statement without generating an error
> (and haven't been able to find any similar examples in the documentation).
>
> val productsRdd = sqlContext.sql("SELECT
> Locales.Invariant.Metadata.item.f:price FROM products LIMIT 10")
>
> gives me the following error...
>
> java.lang.RuntimeException: [1.41] failure: ``UNION'' expected but `:'
> found
>
> Changing the column name is one option, but I have other systems depending
> on this right now so it's not a trivial exercise. :(
>
> I'm using Spark 1.2.
>
> Thanks in advance for any advice / help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Column-name-including-a-colon-in-a-SELECT-clause-tp21576.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai
Hello Michael,

In Spark SQL, we have our internal concepts of Output Partitioning
(representing the partitioning scheme of an operator's output) and Required
Child Distribution (representing the requirement of input data distribution
of an operator) for a physical operator. Let's say we have two operators,
parent and child, and the parent takes the output of the child as its
input. At the end of query planning process, whenever the Output
Partitioning of the child does not satisfy the Required Child Distribution
of the parent, we will add an Exchange operator between the parent and
child to shuffle the data. Right now, we do not record the partitioning
scheme of an input table. So, I think even if you use partitionBy (or
DISTRIBUTE BY in SQL) to prepare your data, you still will see the Exchange
operator and your GROUP BY operation will be executed in a new stage (after
the Exchange).

Making Spark SQL aware of the partitioning scheme of input tables is a
useful optimization. I just created
https://issues.apache.org/jira/browse/SPARK-5354 to track it.

Thanks,

Yin



On Wed, Jan 21, 2015 at 8:43 AM, Michael Davies <
michael.belldav...@gmail.com> wrote:

> Hi Cheng,
>
> Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(
> 1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema)
> then Spark SQL will know that an SQL “group by” on Customer Code will not
> have to shuffle?
>
> But the prepared will have already shuffled so we pay an upfront cost for
> future groupings (assuming we cache I suppose)
>
> Mick
>
> On 20 Jan 2015, at 20:44, Cheng Lian  wrote:
>
>  First of all, even if the underlying dataset is partitioned as expected,
> a shuffle can’t be avoided. Because Spark SQL knows nothing about the
> underlying data distribution. However, this does reduce network IO.
>
> You can prepare your data like this (say CustomerCode is a string field
> with ordinal 1):
>
> val schemaRdd = sql(...)val schema = schemaRdd.schemaval prepared = 
> schemaRdd.keyBy(_.getString(1)).partitionBy(new 
> HashPartitioner(n)).values.applySchema(schema)
>
> n should be equal to spark.sql.shuffle.partitions.
>
> Cheng
>
> On 1/19/15 7:44 AM, Mick Davies wrote:
>
>
>  Is it possible to use a HashPartioner or something similar to distribute a
> SchemaRDDs data by the hash of a particular column or set of columns.
>
> Having done this I would then hope that GROUP BY could avoid shuffle
>
> E.g. set up a HashPartioner on CustomerCode field so that
>
> SELECT CustomerCode, SUM(Cost)
> FROM Orders
> GROUP BY CustomerCode
>
> would not need to shuffle.
>
> Cheers
> Mick
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> ​
>
>
>


Re: Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Yin Huai
Yeah, it's a bug. It has been fixed by
https://issues.apache.org/jira/browse/SPARK-3891 in master.

On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu  wrote:

> Looking at the source code for AbstractGenericUDAFResolver, the following
> (non-deprecated) method should be called:
>
>   public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info)
>
> It is called by hiveUdfs.scala (master branch):
>
> val parameterInfo = new
> SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false)
> resolver.getEvaluator(parameterInfo)
>
> FYI
>
> On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang 
> wrote:
>
>> Hi,
>>
>> The following SQL query
>>
>> select percentile_approx(variables.var1, 0.95) p95
>> from model
>>
>> will throw
>>
>> ERROR SparkSqlInterpreter: Error
>> org.apache.hadoop.hive.ql.parse.SemanticException: This UDAF does not
>> support the deprecated getEvaluator() method.
>> at
>> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:53)
>> at
>> org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector$lzycompute(hiveUdfs.scala:196)
>> at
>> org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector(hiveUdfs.scala:195)
>> at
>> org.apache.spark.sql.hive.HiveGenericUdaf.dataType(hiveUdfs.scala:203)
>> at
>> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:105)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)
>>
>> I'm using latest branch-1.2
>>
>> I found in PR that percentile and percentile_approx are supported. A bug?
>>
>> Thanks,
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


Re: How to convert RDD to JSON?

2014-12-08 Thread Yin Huai
If you are using spark SQL in 1.2, you can use toJson to convert a
SchemaRDD to an RDD[String] that contains one JSON object per string value.

Thanks,

Yin

On Mon, Dec 8, 2014 at 11:52 PM, YaoPau  wrote:

> Pretty straightforward: Using Scala, I have an RDD that represents a table
> with four columns.  What is the recommended way to convert the entire RDD
> to
> one JSON object?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai
Hello Jianshi,

You meant you want to convert a Map to a Struct, right? We can extract some
useful functions from JsonRDD.scala, so others can access them.

Thanks,

Yin

On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang 
wrote:

> I checked the source code for inferSchema. Looks like this is exactly what
> I want:
>
>   val allKeys = rdd.map(allKeysWithValueTypes).reduce(_ ++ _)
>
> Then I can do createSchema(allKeys).
>
> Jianshi
>
> On Sun, Dec 7, 2014 at 2:50 PM, Jianshi Huang 
> wrote:
>
>> Hmm..
>>
>> I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782
>>
>> Jianshi
>>
>> On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang 
>> wrote:
>>
>>> Hi,
>>>
>>> What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD?
>>>
>>> I'm currently converting each Map to a JSON String and do
>>> JsonRDD.inferSchema.
>>>
>>> How about adding inferSchema support to Map[String, Any] directly? It
>>> would be very useful.
>>>
>>> Thanks,
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: can't get smallint field from hive on spark

2014-11-26 Thread Yin Huai
For "hive on spark", did you mean the thrift server of Spark SQL or
https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter
one, I think Hive's mailing list will be a good place to ask (see
https://hive.apache.org/mailing_lists.html).

Thanks,

Yin

On Wed, Nov 26, 2014 at 10:49 PM, 诺铁  wrote:

> thank you very much.
>
> On Thu, Nov 27, 2014 at 11:30 AM, Michael Armbrust  > wrote:
>
>> This has been fixed in Spark 1.1.1 and Spark 1.2
>>
>> https://issues.apache.org/jira/browse/SPARK-3704
>>
>> On Wed, Nov 26, 2014 at 7:10 PM, 诺铁  wrote:
>>
>>> hi,
>>>
>>> don't know whether this question should be asked here, if not, please
>>> point me out, thanks.
>>>
>>> we are currently using hive on spark, when reading a small int field, it
>>> reports error:
>>> Cannot get field 'i16Val' because union is currently set to i32Val
>>>
>>> I googled and find only source code of
>>> TColumnValue.java
>>> 
>>>   where the exception is thrown
>>>
>>> don't know where to go, please help
>>>
>>
>>
>


Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai
Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive
table. Hive does not have the notion of "containsNull" for array values.
So, for a Hive table, the containsNull will be always true for an array and
we should ignore this field for Hive. This issue has been fixed by
https://issues.apache.org/jira/browse/SPARK-4245, which will be released
with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan 
wrote:

> After playing around with this a little more, I discovered that:
>
> 1. If test.json contains something like {"values":[null,1,2,3]}, the
> schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
> (containsNull = true)", and then
> SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
> makes sense but doesn't really help).
> 2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
> StructType(Seq(StructField("values", ArrayType(IntegerType, true),
> true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
> work, though as I mentioned before, this is less than ideal.
>
> Why don't saveAsTable/insertInto work when the containsNull properties
> don't match?  I can understand how inserting data with containsNull=true
> into a column where containsNull=false might fail, but I think the other
> way around (which is the case here) should work.
>
> ~ Jonathan
>
>
> On 11/26/14, 5:23 PM, "Kelly, Jonathan"  wrote:
>
> >I've noticed some strange behavior when I try to use
> >SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
> >that contains elements with nested arrays.  For example, with a file
> >test.json that contains the single line:
> >
> >   {"values":[1,2,3]}
> >
> >and with code like the following:
> >
> >scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> >scala> val test = sqlContext.jsonFile("test.json")
> >scala> test.saveAsTable("test")
> >
> >it creates the table but fails when inserting the data into it.  Here¹s
> >the exception:
> >
> >scala.MatchError: ArrayType(IntegerType,true) (of class
> >org.apache.spark.sql.catalyst.types.ArrayType)
> >   at
> >org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
> >2
> >47)
> >   at
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> >   at
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> >   at
> >org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
> >a
> >:84)
> >   at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:66)
> >   at
> >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
> >l
> >y(Projection.scala:50)
> >   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >   at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
> $apache$spark$s
> >q
> >l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
> >a
> >la:149)
> >   at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >   at
> >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
> >e
> >File$1.apply(InsertIntoHiveTable.scala:158)
> >   at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> >   at org.apache.spark.scheduler.Task.run(Task.scala:54)
> >   at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> >   at
> >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> >1
> >145)
> >   at
> >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
> >:
> >615)
> >   at java.lang.Thread.run(Thread.java:745)
> >
> >I'm guessing that this is due to the slight difference in the schemas of
> >these tables:
> >
> >scala> test.printSchema
> >root
> > |-- values: array (nullable = true)
> > ||-- element: integer (containsNull = false)
> >
> >
> >scala> sqlContext.table("test").printSchema
> >root
> > |-- values: array (nullable = true)
> > ||-- element: integer (containsNull = true)
> >
> >If I reload the file using the schema that was created for the Hive table
> >then try inserting the data into the table, it works:
> >
> >scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
> >sqlContext.table("test").schema).insertInto("test")
> >scala> sqlContext.sql("select * from test").collect().foreach(println)
> >[ArrayBuffer(1, 2, 3)]
> >
> >Does this mean that there is a bug with how the schema is being
> >automatically determined when you use HiveContext.jsonFile() for JSON
> >files that contain nested arrays?  (i.e., should containsNull be true for
> >the array elements?)  Or is there a bug with how the Hive table is created
> >from the SchemaRDD?  (i.e., should containsNull in fact be 

  1   2   >