from:"Yin Huai"

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai

Hello Jacek,

Actually, Reynold is still the release manager and I am just sending this
message for him :) Sorry. I should have made it clear in my original email.

Thanks,

Yin

On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Yan,
>
> I've been surprised the first time when I noticed rxin stepped back and a
> new release manager stepped in. Congrats on your first ANNOUNCE!
>
> I can only expect even more great stuff coming in to Spark from the dev
> team after Reynold spared some time 
>
> Can't wait to read the changes...
>
> Jacek
>
> On 29 Dec 2016 5:03 p.m., "Yin Huai" <yh...@databricks.com> wrote:
>
>> Hi all,
>>
>> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
>> makes significant strides in the production readiness of Structured
>> Streaming, with added support for event time watermarks
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-programming-guide.html#handling-late-data-and-watermarking>
>> and Kafka 0.10 support
>> <https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html>.
>> In addition, this release focuses more on usability, stability, and polish,
>> resolving over 1200 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.1.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: https://spark.apache.or
>> g/releases/spark-release-2-1-0.html
>>
>> (note: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list)
>>
>>
>>

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai

Hello,

We spent sometime preparing artifacts and changes to the website (including
the release notes). I just sent out the the announcement. 2.1.0 is
officially released.

Thanks,

Yin

On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Interesting, because a bug that seemed to be fixed in 2.1.0-SNAPSHOT
> doesn't appear to be fixed in 2.1.0 stable (it centered around a
> null-pointer exception during code gen). It seems to be fixed in
> 2.1.1-SNAPSHOT, but I can try stable again.
>
> On Dec 28, 2016, at 1:38 PM, Mark Hamstra  wrote:
>
> A SNAPSHOT build is not a stable artifact, but rather floats to the top of
> commits that are intended for the next release.  So, 2.1.1-SNAPSHOT comes
> after the 2.1.0 release and contains any code at the time that the artifact
> was built that was committed to the branch-2.1 maintenance branch and is,
> therefore, intended for the eventual 2.1.1 maintenance release.  Once a
> release is tagged and stable artifacts for it can be built, there is no
> purpose for s SNAPSHOT of that release -- e.g. there is no longer any
> purpose for a 2.1.0-SNAPSHOT release; if you want 2.1.0, then you should be
> using stable artifacts now, not SNAPSHOTs.
>
> The existence of a SNAPSHOT doesn't imply anything about the release date
> of the associated finished version.  Rather, it only indicates a name that
> is attached to all of the code that is currently intended for the
> associated release number.
>
> On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <
> justin.mil...@protectwise.com> wrote:
>
>> It looks like the jars for 2.1.0-SNAPSHOT are gone?
>>
>> https://repository.apache.org/content/groups/snapshots/org/a
>> pache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/
>>
>> Also:
>>
>> 2.1.0-SNAPSHOT/
>> 
>>  Fri
>> Dec 23 16:31:42 UTC 2016
>> 2.1.1-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 20:01:10 UTC 2016
>> 2.2.0-SNAPSHOT/
>> 
>>  Wed
>> Dec 28 19:12:38 UTC 2016
>>
>> What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
>>
>> Thanks!
>> Justin
>>
>> On Dec 28, 2016, at 12:53 PM, Mark Hamstra 
>> wrote:
>>
>> The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
>>
>> On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers  wrote:
>>
>>> seems like the artifacts are on maven central but the website is not yet
>>> updated.
>>>
>>> strangely the tag v2.1.0 is not yet available on github. i assume its
>>> equal to v2.1.0-rc5
>>>
>>> On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
>>> justin.mil...@protectwise.com> wrote:
>>>
 I'm curious about this as well. Seems like the vote passed.

 > On Dec 23, 2016, at 2:00 AM, Aseem Bansal 
 wrote:
 >
 >


 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>
>

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai

Hi all,

Apache Spark 2.1.0 is the second release of Spark 2.x line. This release
makes significant strides in the production readiness of Structured
Streaming, with added support for event time watermarks

and Kafka 0.10 support
.
In addition, this release focuses more on usability, stability, and polish,
resolving over 1200 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.1.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-1-0.html

(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list)

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Yin Huai

Hello Michael,

Thank you for reporting this issue. It will be fixed by
https://github.com/apache/spark/pull/16080.

Thanks,

Yin

On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao  wrote:

> Hi!
>
> Do you have real HIVE installation?
> Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive
> -Phive-thriftserver ) ?
>
> It seems that you use "default" Spark's HIVE 1.2.1. Your metadata is
> stored in local Derby DB which is visible to concrete Spark installation
> but not for all.
>
> On Wed, Nov 30, 2016 at 4:51 AM, Michael Allman 
> wrote:
>
>> This is not an issue with all tables created in Spark 2.1, though I'm not
>> sure why some work and some do not. I have found that a table created as
>> such
>>
>> sql("create table test stored as parquet as select 1")
>>
>> in Spark 2.1 cannot be read in previous versions of Spark.
>>
>> Michael
>>
>>
>> > On Nov 29, 2016, at 5:15 PM, Michael Allman 
>> wrote:
>> >
>> > Hello,
>> >
>> > When I try to read from a Hive table created by Spark 2.1 in Spark 2.0
>> or earlier, I get an error:
>> >
>> > java.lang.ClassNotFoundException: Failed to load class for data
>> source: hive.
>> >
>> > Is there a way to get previous versions of Spark to read tables written
>> with Spark 2.1?
>> >
>> > Cheers,
>> >
>> > Michael
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai

Hi, we made the change because the partitioning discovery logic was too
flexible and it introduced problems that were very confusing to users. To
make your case work, we have introduced a new data source option called
basePath. You can use

DataFrame df = hiveContext.read().format("orc").option("basePath", "
path/to/table/").load("path/to/table/entity=xyz")

So, the partitioning discovery logic will understand that the base
path is path/to/table/
and your dataframe will has the column "entity".

You can find the doc at the end of partitioning discovery section of the
sql programming guide (
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
).

Thanks,

Yin

On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:

> Hi from Spark 1.6 onwards as per this  doc
> <
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> >
> We cant add specific hive partitions to DataFrame
>
> spark 1.5 the following used to work and the following dataframe will have
> entity column
>
> DataFrame df =
> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>
> But in Spark 1.6 above does not work and I have to give base path like the
> following but it does not contain entity column which I want in DataFrame
>
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>
> How do I load specific hive partition in a dataframe? What was the driver
> behind removing this feature which was efficient I believe now above Spark
> 1.6 code load all partitions and if I filter for specific partitions it is
> not efficient it hits memory and throws GC error because of thousands of
> partitions get loaded into memory and not the specific one please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai

No problem! Glad it helped!

On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi Yin, thanks much your answer solved my problem. Really appreciate it!
>
> Regards
>
>
> On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi, we made the change because the partitioning discovery logic was too
>> flexible and it introduced problems that were very confusing to users. To
>> make your case work, we have introduced a new data source option called
>> basePath. You can use
>>
>> DataFrame df = hiveContext.read().format("orc").option("basePath", "
>> path/to/table/").load("path/to/table/entity=xyz")
>>
>> So, the partitioning discovery logic will understand that the base path
>> is path/to/table/ and your dataframe will has the column "entity".
>>
>> You can find the doc at the end of partitioning discovery section of the
>> sql programming guide (
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> ).
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Jan 7, 2016 at 7:34 AM, unk1102 <umesh.ka...@gmail.com> wrote:
>>
>>> Hi from Spark 1.6 onwards as per this  doc
>>> <
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>> >
>>> We cant add specific hive partitions to DataFrame
>>>
>>> spark 1.5 the following used to work and the following dataframe will
>>> have
>>> entity column
>>>
>>> DataFrame df =
>>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>>
>>> But in Spark 1.6 above does not work and I have to give base path like
>>> the
>>> following but it does not contain entity column which I want in DataFrame
>>>
>>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>>
>>> How do I load specific hive partition in a dataframe? What was the driver
>>> behind removing this feature which was efficient I believe now above
>>> Spark
>>> 1.6 code load all partitions and if I filter for specific partitions it
>>> is
>>> not efficient it hits memory and throws GC error because of thousands of
>>> partitions get loaded into memory and not the specific one please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Yin Huai

Hi Eran,

Can you try 1.6? With the change in
https://github.com/apache/spark/pull/10288, JSON data source will not throw
a runtime exception if there is any record that it cannot parse. Instead,
it will put the entire record to the column of "_corrupt_record".

Thanks,

Yin

On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon  wrote:

> Thanks for this!
> This was the problem...
>
> On Sun, 20 Dec 2015 at 18:49 Chris Fregly  wrote:
>
>> hey Eran, I run into this all the time with Json.
>>
>> the problem is likely that your Json is "too pretty" and extending beyond
>> a single line which trips up the Json reader.
>>
>> my solution is usually to de-pretty the Json - either manually or through
>> an ETL step - by stripping all white space before pointing my
>> DataFrame/JSON reader at the file.
>>
>> this tool is handy for one-off scenerios:  http://jsonviewer.stack.hu
>>
>> for streaming use cases, you'll want to have a light de-pretty ETL step
>> either within the Spark Streaming job after ingestion - or upstream using
>> something like a Flume interceptor, NiFi Processor (I love NiFi), or Kafka
>> transformation assuming those exist by now.
>>
>> a similar problem exists for XML, btw.  there's lots of wonky workarounds
>> for this that use MapPartitions and all kinds of craziness.  the best
>> option, in my opinion, is to just ETL/flatten the data to make the
>> DataFrame reader happy.
>>
>> On Dec 19, 2015, at 4:55 PM, Eran Witkon  wrote:
>>
>> Hi,
>> I tried the following code in spark-shell on spark1.5.2:
>>
>> *val df =
>> sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")*
>> *df.count()*
>>
>> 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size =
>> 67108864 bytes, TID = 3
>> 15/12/19 23:49:40 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID
>> 3)
>> java.lang.RuntimeException: Failed to parse a value for data type
>> StructType() (current token: VALUE_STRING).
>> at scala.sys.package$.error(package.scala:27)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:172)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:251)
>> at
>> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:246)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:365)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>> at
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
>> 1.org
>> $apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>>
>> Am I am doing something wrong?
>> Eran
>>
>>

Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Yin Huai

Can you attach the result of eventDF.filter($"entityType" ===
"user").select("entityId").distinct.explain(true)?

Thanks,

Yin

On Thu, Nov 5, 2015 at 1:12 AM, 千成徳  wrote:

> Hi All,
>
> I have data frame like this.
>
> Equality expression is not working in 1.5.1 but, works as expected in 1.4.0
> What is the difference?
>
> scala> eventDF.printSchema()
> root
>  |-- id: string (nullable = true)
>  |-- event: string (nullable = true)
>  |-- entityType: string (nullable = true)
>  |-- entityId: string (nullable = true)
>  |-- targetEntityType: string (nullable = true)
>  |-- targetEntityId: string (nullable = true)
>  |-- properties: string (nullable = true)
>
> scala> eventDF.groupBy("entityType").agg(countDistinct("entityId")).show
> +--++
> |entityType|COUNT(DISTINCT entityId)|
> +--++
> |   ib_user|4751|
> |  user|2091|
> +--++
>
>
> - not works ( Bug ? )
> scala> eventDF.filter($"entityType" ===
> "user").select("entityId").distinct.count
> res151: Long = 1219
>
> scala> eventDF.filter(eventDF("entityType") ===
> "user").select("entityId").distinct.count
> res153: Long = 1219
>
> scala> eventDF.filter($"entityType" equalTo
> "user").select("entityId").distinct.count
> res149: Long = 1219
>
> - works as expected
> scala> eventDF.map{ e => (e.getAs[String]("entityId"),
> e.getAs[String]("entityType")) }.filter(x => x._2 ==
> "user").map(_._1).distinct.count
> res150: Long = 2091
>
> scala> eventDF.filter($"entityType" in
> "user").select("entityId").distinct.count
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> res155: Long = 2091
>
> scala> eventDF.filter($"entityType" !==
> "ib_user").select("entityId").distinct.count
> res152: Long = 2091
>
>
> But, All of above code works in 1.4.0
>
> Thanks.
>
>

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai

Hi Ross,

What version of spark are you using? There were two issues that affected
the results of window function in Spark 1.5 branch. Both of issues have
been fixed and will be released with Spark 1.5.2 (this release will happen
soon). For more details of these two issues, you can take a look at
https://issues.apache.org/jira/browse/SPARK-11135 and
https://issues.apache.org/jira/browse/SPARK-11009.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM,  wrote:

> Hello Spark community -
> I am running a Spark SQL query to calculate the difference in time between
> consecutive events, using lag(event_time) over window -
>
> SELECT device_id,
>unix_time,
>event_id,
>unix_time - lag(unix_time)
>   OVER
> (PARTITION BY device_id ORDER BY unix_time,event_id)
>  AS seconds_since_last_event
> FROM ios_d_events;
>
> This is giving me some strange results in the case where the first two
> events for a particular device_id have the same timestamp.
> I used to following query to take a look at what value was being returned
> by lag():
>
> SELECT device_id,
>event_time,
>unix_time,
>event_id,
>lag(event_time) OVER (PARTITION BY device_id ORDER BY 
> unix_time,event_id) AS lag_time
> FROM ios_d_events;
>
> I’m seeing that in these cases, I am getting something like 1970-01-03 …
> instead of a null value, and the following lag times are all following the
> same format.
>
> I posted a section of this output in this SO question:
> http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls
>
> The errant results are labeled with device_id 999.
>
> Any idea why this is occurring?
>
> - Ross
>

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai

@filthysocks, can you get the output of jmap -histo before the OOM (
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html)?

On Mon, Oct 26, 2015 at 6:35 AM, filthysocks  wrote:

> We upgrade from 1.4.1 to 1.5 and it's a pain
> see
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-driver-memory-problems-while-doing-Cross-Validation-do-not-occur-with-1-4-1-td25076.html
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-feels-sparkSQL-in-spark1-5-1-very-slow-tp25154p25204.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Yin Huai

btw, what version of Spark did you use?

On Mon, Oct 19, 2015 at 1:08 PM, YaoPau  wrote:

> I've connected Spark SQL to the Hive Metastore and currently I'm running
> SQL
> code via pyspark.  Typically everything works fine, but sometimes after a
> long-running Spark SQL job I get the error below, and from then on I can no
> longer run Spark SQL commands.  I still do have both my sc and my sqlCtx.
>
> Any idea what this could mean?
>
> An error occurred while calling o36.sql.
> : org.apache.spark.sql.AnalysisException: Conf non-local session path
> expected to be non-null;
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:260)
> at
>
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at
>
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at
>
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at
>
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:139)
> at
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:139)
> at
>
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
> at
>
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
>
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at
>
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at
>
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at
>
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:235)
> at
>
> org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
> at
>
> org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92)
> at

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai

Hi Pala,

Can you add the full stacktrace of the exception? For now, can you use
create temporary function to workaround the issue?

Thanks,

Yin

On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia <
mchett...@rocketfuelinc.com.invalid> wrote:

> +user list
>
> On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> I am trying to use internal UDFs that we have added as permanent
>> functions to Hive, from within Spark SQL query (using HiveContext), but i
>> encounter NoSuchObjectException, i.e. the function could not be found.
>>
>> However, if i execute 'show functions' command in spark SQL, the
>> permanent functions appear in the list.
>>
>> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by looking
>> at the log and code, but it seems both the show functions command as well
>> as udf query both go through essentially the same code path, but the former
>> can see the UDF but the latter can't.
>>
>> Any ideas on how to debug/fix this?
>>
>>
>> Thanks,
>> pala
>>
>
>

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai

table.List.foldLeft(List.scala:84)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC.(:39)
> at $iwC.(:41)
> at (:43)
> at .(:47)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> On Thu, Oct 1, 2015 at 12:27 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Pala,
>>
>> Can you add the full stacktrace of the exception? For now, can you use
>> create temporary function to workaround the issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com.invalid> wrote:
>>
>>> +user list
>>>
>>> On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia <
>>> mchett...@rocketfuelinc.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to use internal UDFs that we have added as permanent
>>>> functions to Hive, from within Spark SQL query (using HiveContext), but i
>>>> encounter NoSuchObjectException, i.e. the function could not be found.
>>>>
>>>> However, if i execute 'show functions' command in spark SQL, the
>>>> permanent functions appear in the list.
>>>>
>>>> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by
>>>> looking at the log and code, but it seems both the show functions command
>>>> as well as udf query both go through essentially the same code path, but
>>>> the former can see the UDF but the latter can't.
>>>>
>>>> Any ideas on how to debug/fix this?
>>>>
>>>>
>>>> Thanks,
>>>> pala
>>>>
>>>
>>>
>>
>

Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai

Hi Ritesh,

Right now, we only allow specific data types defined in the inputSchema.
Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF
be more complex. It will be great to understand the use cases first. What
kinds of possible input data types that you want to support and do you need
to know the actual argument types to determine how to process input data?

btw, for now, one possible workaround is to define multiple UDAFs for
different input types. Then, based on arguments that you have, you invoke
the corresponding UDAF.

Thanks,

Yin

On Fri, Sep 25, 2015 at 8:07 AM, Ritesh Agrawal <
ragra...@netflix.com.invalid> wrote:

> Hi all,
>
> I am trying to learn about UDAF and implemented a simple reservoir sample
> UDAF. It's working fine. However I am not able to figure out what DataType
> should I use so that its can deal with all DataTypes (simple and complex).
> For instance currently I have defined my input schema as
>
>  def inputSchema = StructType(StructField("id", StringType) :: Nil )
>
>
> Instead of StringType can I use some other data type that is superclass of
> all the DataTypes ?
>
> Ritesh
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Seems 1.4 has the same issue.

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam <chiling...@gmail.com> wrote:

> I just noticed you found 1.4 has the same issue. I added that as well in
> the ticket.
>
> On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Yin,
>>
>> You are right! I just tried the scala version with the above lines, it
>> works as expected.
>> I'm not sure if it happens also in 1.4 for pyspark but I thought the
>> pyspark code just calls the scala code via py4j. I didn't expect that this
>> bug is pyspark specific. That surprises me actually a bit. I created a
>> ticket for this (SPARK-10731
>> <https://issues.apache.org/jira/browse/SPARK-10731>).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> btw, does 1.4 has the same problem?
>>>
>>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> Hi Jerry,
>>>>
>>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark Developers,
>>>>>
>>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>>> the execution plan of take(1), head() or first().
>>>>>
>>>>> For your reference, this is what I did in pyspark 1.5:
>>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>>> df.head()
>>>>>
>>>>> The above lines take over 15 minutes. I was frustrated because I can
>>>>> do better without using spark :) Since I like spark, so I tried to figure
>>>>> out why. It seems the dataframe requires 3 stages to give me the first 
>>>>> row.
>>>>> It reads all data (which is about 1 billion rows) and run Limit twice.
>>>>>
>>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>>> do:
>>>>>
>>>>> df.rdd.take(1) //runs much faster.
>>>>>
>>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it
>>>>> a bug in the optimizer? or I did something wrong?
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

Hi Jerry,

Looks like it is a Python-specific issue. Can you create a JIRA?

Thanks,

Yin

On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1), head() or first().
>
> For your reference, this is what I did in pyspark 1.5:
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
>
> The above lines take over 15 minutes. I was frustrated because I can do
> better without using spark :) Since I like spark, so I tried to figure out
> why. It seems the dataframe requires 3 stages to give me the first row. It
> reads all data (which is about 1 billion rows) and run Limit twice.
>
> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>
> df.rdd.take(1) //runs much faster.
>
> Is this expected? Why head/first/take is so slow for dataframe? Is it a
> bug in the optimizer? or I did something wrong?
>
> Best Regards,
>
> Jerry
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai

btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:

> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark Developers,
>>
>> I just ran some very simple operations on a dataset. I was surprise by
>> the execution plan of take(1), head() or first().
>>
>> For your reference, this is what I did in pyspark 1.5:
>> df=sqlContext.read.parquet("someparquetfiles")
>> df.head()
>>
>> The above lines take over 15 minutes. I was frustrated because I can do
>> better without using spark :) Since I like spark, so I tried to figure out
>> why. It seems the dataframe requires 3 stages to give me the first row. It
>> reads all data (which is about 1 billion rows) and run Limit twice.
>>
>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>
>> df.rdd.take(1) //runs much faster.
>>
>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>> bug in the optimizer? or I did something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Re: Null Value in DecimalType column of DataFrame

2015-09-17 Thread Yin Huai

As I mentioned before, the range of values of DecimalType(10, 10) is [0,
1). If you have a value 10.5 and you want to cast it to DecimalType(10,
10), I do not think there is any better returned value except of null.
Looks like DecimalType(10, 10) is not the right type for your use case. You
need a decimal type that has precision - scale >= 2.

On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

>
> Hi Yin, posted here because I think it's a bug.
> So, it will return null and I can get a nullpointerexception, as I was
> getting. Is this really the expected behavior? Never seen something
> returning null in other Scala tools that I used.
>
> Regards,
>
>
> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>
>> btw, move it to user list.
>>
>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> A scale of 10 means that there are 10 digits at the right of the decimal
>>> point. If you also have precision 10, the range of your data will be [0, 1)
>>> and casting "10.5" to DecimalType(10, 10) will return null, which is
>>> expected.
>>>
>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>> DecimalType
>>>>
>>>> This ugly code is a little sample to reproduce the error, don't use it
>>>> into your project.
>>>>
>>>> test("spark test") {
>>>>   val file = 
>>>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>>>> Row.fromSeq({
>>>> val values = f.split(",")
>>>> 
>>>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>> values.tail.tail.tail.head)}))
>>>>
>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>> StructField("int2", IntegerType, false), StructField("double",
>>>>
>>>>  DecimalType(10,10), false),
>>>>
>>>>
>>>> StructField("str2", StringType, false)))
>>>>
>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>   df.first
>>>> }
>>>>
>>>> The content of the file is:
>>>>
>>>> 1,5,10.5,va
>>>> 2,1,0.1,vb
>>>> 3,8,10.0,vc
>>>>
>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>>> DecimalType(10,10) the Decimal values
>>>> 10.5 became null, and the 0.1 works.
>>>>
>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>
>>>> Regards,
>>>> Dirceu
>>>>
>>>>
>>>
>>
>
>

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai

btw, move it to user list.

On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:

> A scale of 10 means that there are 10 digits at the right of the decimal
> point. If you also have precision 10, the range of your data will be [0, 1)
> and casting "10.5" to DecimalType(10, 10) will return null, which is
> expected.
>
> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi all,
>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>> It seems that there was some changes in org.apache.spark.sql.types.
>> DecimalType
>>
>> This ugly code is a little sample to reproduce the error, don't use it
>> into your project.
>>
>> test("spark test") {
>>   val file = 
>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>> Row.fromSeq({
>> val values = f.split(",")
>> 
>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>> values.tail.tail.tail.head)}))
>>
>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>> StructField("int2", IntegerType, false), StructField("double",
>>
>>  DecimalType(10,10), false),
>>
>>
>> StructField("str2", StringType, false)))
>>
>>   val df = context.sqlContext.createDataFrame(file,structType)
>>   df.first
>> }
>>
>> The content of the file is:
>>
>> 1,5,10.5,va
>> 2,1,0.1,vb
>> 3,8,10.0,vc
>>
>> The problem resides in DecimalType, before 1.5 the scala wasn't required.
>> Now when using  DecimalType(12,10) it works fine, but using
>> DecimalType(10,10) the Decimal values
>> 10.5 became null, and the 0.1 works.
>>
>> Is there anybody working with DecimalType for 1.5.1?
>>
>> Regards,
>> Dirceu
>>
>>
>

Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai

Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it
has been fixed in branch 1.5. 1.5.1 release will have it.

On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Hi all ,
> After upgrade spark to 1.5 ,  Streaming throw
> java.util.NoSuchElementException: key not found occasionally , is the
> problem of data cause this error ?  please help me if anyone got similar
> problem before , Thanks very much.
>
> the exception accur when write into database.
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 
> (TID 76, slave2): java.util.NoSuchElementException: key not found: 
> ruixue.sys.session.request
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>
> at 
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
>
> at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
>
> at 
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
>
> at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>
> at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> --
> guoqing0...@yahoo.com.hk
>

Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai

For now, user-defined window function is not supported. We will add it in
future.

On Mon, Aug 24, 2015 at 6:26 AM, xander92 alexander.fra...@ompnt.com
wrote:

 The ultimate aim of my program is to be able to wrap an arbitrary Scala
 function (mostly will be statistics / customized rolling window metrics) in
 a UDF and evaluate them on DataFrames using the window functionality.

 So my main question is how do I express that a UDF takes a Frame of rows
 from a DataFrame as an argument instead of just a single row? And what sort
 of arguments would the arbitrary Scala function need to take in order to
 handle the raw data from the Frame?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-evaluate-custom-UDF-over-window-tp24419.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai

Can you try to use HiveContext instead of SQLContext? Your query is trying
to create a table and persist the metadata of the table in metastore, which
is only supported by HiveContext.

On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan yu...@useinsider.com
wrote:

 Hello,

 I’m trying to create a table with sqlContext.sql method as below:

 *val sc = new SparkContext()*
 *val sqlContext = new SQLContext(sc)*

 *import sqlContext.implicits._*

 *sqlContext.sql(s*
 *create table if not exists landing (*
 *date string,*
 *referrer string*
 *)*
 *partitioned by (partnerid string,dt string)*
 *row format delimited fields terminated by '\t' lines terminated by '\n'*
 *STORED AS TEXTFILE LOCATION 's3n://...'*
 *  ”)*


 It gives error on spark-submit:

 *Exception in thread main java.lang.RuntimeException: [2.1] failure:
 ``with'' expected but identifier create found*

 *create external table if not exists landing (*

 *^*
 * at scala.sys.package$.error(package.scala:27)*
 * at
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)*
 * at
 org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)*



 What can be the reason??

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai

Hi Todd,

We have not got a chance to update it. We will update it after 1.5 release.

Thanks,

Yin

On Thu, Aug 13, 2015 at 6:49 AM, Todd bit1...@163.com wrote:

 Hi,
 I got a question about the spark-sql-perf project by Databricks at 
 https://github.com/databricks/spark-sql-perf/


 The Tables.scala (
 https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala)
 and BigData (
 https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/BigData.scala)
 are  empty files.
 Is this by intention or this is a bug.
 Also,the code snippet as follows in the README.MD won't compile  as there
 is no Tables class defined in the org.apache.spark.sql.parquet package:
 (I am using Spark1.4.1, is the code compatible with Spark 1.4.1?)

 import org.apache.spark.sql.parquet.Tables
 // Tables in TPC-DS benchmark used by experiments.
 val tables = Tables(sqlContext)
 // Setup TPC-DS experiment
 val tpcds = new TPCDS (sqlContext = sqlContext)

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai

We do this in SparkILookp (
https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
What is the version of Spark you are using? How did you add the spark-csv
jar?

On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers ko...@tresata.com wrote:

 has anyone tried to make HiveContext only if the class is available?

 i tried this:
  implicit lazy val sqlc: SQLContext = try {
 Class.forName(org.apache.spark.sql.hive.HiveContext, true,
 Thread.currentThread.getContextClassLoader)

 .getConstructor(classOf[SparkContext]).newInstance(sc).asInstanceOf[SQLContext]
   } catch { case e: ClassNotFoundException = new SQLContext(sc) }

 it compiles fine, but i get classloader issues when i actually use it on a
 cluster. for example:

 Exception in thread main java.lang.RuntimeException: Failed to load
 class for data source: com.databricks.spark.csv
 at scala.sys.package$.error(package.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai

No problem:) Glad to hear that!

On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers ko...@tresata.com wrote:

 that solved it, thanks!

 On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers ko...@tresata.com wrote:

 thanks i will try 1.4.1

 On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai yh...@databricks.com wrote:

 Hi Koert,

 For the classloader issue, you probably hit
 https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed
 in Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear?

 Thanks,

 Yin

 On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers ko...@tresata.com
 wrote:

 i am using scala 2.11

 spark jars are not in my assembly jar (they are provided), since i
 launch with spark-submit

 On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers ko...@tresata.com
 wrote:

 spark 1.4.0

 spark-csv is a normal dependency of my project and in the assembly jar
 that i use

 but i also tried adding spark-csv with --package for spark-submit, and
 got the same error

 On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai yh...@databricks.com
 wrote:

 We do this in SparkILookp (
 https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
 What is the version of Spark you are using? How did you add the spark-csv
 jar?

 On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers ko...@tresata.com
 wrote:

 has anyone tried to make HiveContext only if the class is available?

 i tried this:
  implicit lazy val sqlc: SQLContext = try {
 Class.forName(org.apache.spark.sql.hive.HiveContext, true,
 Thread.currentThread.getContextClassLoader)

 .getConstructor(classOf[SparkContext]).newInstance(sc).asInstanceOf[SQLContext]
   } catch { case e: ClassNotFoundException = new SQLContext(sc) }

 it compiles fine, but i get classloader issues when i actually use
 it on a cluster. for example:

 Exception in thread main java.lang.RuntimeException: Failed to
 load class for data source: com.databricks.spark.csv
 at scala.sys.package$.error(package.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)

Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Yin Huai

Your query will be partitioned once. Then, a single Window operator will
evaluate these three functions. As mentioned by Harish, you can take a look
at the plan (sql(your sql...).explain()).

On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani rhbutani.sp...@gmail.com
wrote:

 Just once.
 You can see this by printing the optimized logical plan.
 You will see just one repartition operation.

 So do:
 val df = sql(your sql...)
 println(df.queryExecution.analyzed)

 On Mon, Jul 13, 2015 at 6:37 AM, Hao Ren inv...@gmail.com wrote:

 Hi,

 I would like to know: Is there any optimization has been done for window
 functions in Spark SQL?

 For example.

 select key,
 max(value1) over(partition by key) as m1,
 max(value2) over(partition by key) as m2,
 max(value3) over(partition by key) as m3
 from table

 The query above creates 3 fields based on the same partition rule.

 The question is:
 Will spark-sql partition the table 3 times in the same way to get the
 three
 max values ? or just partition once if it finds the partition rule is the
 same ?

 It would be nice if someone could point out some lines of code on it.

 Thank you.
 Hao



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Window-Functions-optimization-tp23796.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai

Hi Brandon,

Can you explain what did you mean by It simply does not work? You did not
see new data files?

Thanks,

Yin

On Fri, Jul 10, 2015 at 11:55 AM, Brandon White bwwintheho...@gmail.com
wrote:

 Why does this not work? Is insert into broken in 1.3.1? It does not throw
 any errors, fail, or throw exceptions. It simply does not work.


 val ssc = new StreamingContext(sc, Minutes(10))

 val currentStream = ssc.textFileStream(ss3://textFileDirectory/)
 val dayBefore = sqlContext.jsonFile(ss3://textFileDirectory/)

 dayBefore.saveAsParquetFile(/tmp/cache/dayBefore.parquet)
 val parquetFile = sqlContext.parquetFile(/tmp/cache/dayBefore.parquet)
 parquetFile.registerTempTable(rideaccepted)

 currentStream.foreachRDD { rdd =
   val df = sqlContext.jsonRDD(rdd)
   df.insertInto(rideaccepted)
 }

 ssc.start()


 Or this?

 val ssc = new StreamingContext(sc, Minutes(10))
 val currentStream = ssc.textFileStream(s3://textFileDirectory)
 val day = sqlContext.jsonFile(s3://textFileDirectory)
 day.registerTempTable(rideaccepted)


 currentStream.foreachRDD { rdd =
   val df = sqlContext.jsonRDD(rdd)
   df.registerTempTable(tmp_rideaccepted)
   sqlContext.sql(insert into table rideaccepted select * from 
 tmp_rideaccepted)
 }

 ssc.start()


 or this?

 val ssc = new StreamingContext(sc, Minutes(10))

 val currentStream = ssc.textFileStream(ss3://textFileDirectory/)
 val dayBefore = sqlContext.jsonFile(ss3://textFileDirectory/)

 dayBefore..registerTempTable(rideaccepted)

 currentStream.foreachRDD { rdd =
   val df = sqlContext.jsonRDD(rdd)
   df.insertInto(rideaccepted)
 }

 ssc.start()

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai

Jerrick,

Let me ask a few clarification questions. What is the version of Spark? Is
the table a hive table? What is the format of the table? Is the table
partitioned?

Thanks,

Yin

On Sun, Jul 12, 2015 at 6:01 PM, ayan guha guha.a...@gmail.com wrote:

 Describe computes statistics, so it will try to query the table. The one
 you are looking for is df.printSchema()

 On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,

 I'm new to Spark and this question may be trivial or has already been
 answered, but when I do a 'describe table' from SparkSQL CLI it seems to
 try looking at all records at the table (which takes a really long time for
 big table) instead of just giving me the metadata of the table. Would
 appreciate if someone can give me some pointers, thanks!




 --
 Best Regards,
 Ayan Guha

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai

You meant SPARK_REPL_OPTS? I did a quick search. Looks like it has been
removed since 1.0. I think it did not affect the behavior of the shell.

On Mon, Jul 6, 2015 at 9:04 AM, Simeon Simeonov s...@swoop.com wrote:

   Yin, that did the trick.

  I'm curious what was the effect of the environment variable, however, as
 the behavior of the shell changed from hanging to quitting when the env var
 value got to 1g.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Monday, July 6, 2015 at 11:41 AM
 To: Denny Lee denny.g@gmail.com
 Cc: Simeon Simeonov s...@swoop.com, Andy Huang andy.hu...@servian.com.au,
 user user@spark.apache.org

 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   Hi Sim,

  I think the right way to set the PermGen Size is through driver extra
 JVM options, i.e.

  --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m

  Can you try it? Without this conf, your driver's PermGen size is still
 128m.

  Thanks,

  Yin

 On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee denny.g@gmail.com wrote:

  I went ahead and tested your file and the results from the tests can be
 seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.

  Basically, when running {Java 7, MaxPermSize = 256} or {Java 8,
 default} the query ran without any issues.  I was able to recreate the
 issue with {Java 7, default}.  I included the commands I used to start the
 spark-shell but basically I just used all defaults (no alteration to driver
 or executor memory) with the only additional call was with
 driver-class-path to connect to MySQL Hive metastore.  This is on OSX
 Macbook Pro.

  One thing I did notice is that your version of Java 7 is version 51
 while my version of Java 7 version 79.  Could you see if updating to Java 7
 version 79 perhaps allows you to use the MaxPermSize call?




  On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov s...@swoop.com wrote:

  The file is at
 https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1

  The command was included in the gist

  SPARK_REPL_OPTS=-XX:MaxPermSize=256m
 spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
 com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Monday, July 6, 2015 at 12:59 AM
 To: Simeon Simeonov s...@swoop.com
 Cc: Denny Lee denny.g@gmail.com, Andy Huang 
 andy.hu...@servian.com.au, user user@spark.apache.org

 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   I have never seen issue like this. Setting PermGen size to 256m
 should solve the problem. Can you send me your test file and the command
 used to launch the spark shell or your application?

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:

   Yin,

  With 512Mb PermGen, the process still hung and had to be kill -9ed.

  At 1Gb the spark shell  associated processes stopped hanging and
 started exiting with

  scala println(dfCount.first.getLong(0))
 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
 called with curMem=0, maxMem=2223023063
 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 229.5 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184)
 called with curMem=235040, maxMem=2223023063
 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added
 broadcast_2_piece0 in memory on localhost:65464 (size: 19.7 KB, free: 2.1
 GB)
 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from
 first at console:30
 java.lang.OutOfMemoryError: PermGen space
 Stopping spark context.
 Exception in thread main
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread main
 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
 broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
 GB)

  That did not change up until 4Gb of PermGen space and 8Gb for driver
  executor each.

  I stopped at this point because the exercise started looking silly.
 It is clear that 1.4.0 is using memory in a substantially different manner.

  I'd be happy to share the test file so you can reproduce this in your
 own environment.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com |
 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Sunday, July 5, 2015 at 11:04 PM
 To: Denny Lee denny.g@gmail.com
 Cc: Andy Huang andy.hu...@servian.com.au, Simeon Simeonov 
 s...@swoop.com

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai

Hi Sim,

I think the right way to set the PermGen Size is through driver extra JVM
options, i.e.

--conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m

Can you try it? Without this conf, your driver's PermGen size is still 128m.

Thanks,

Yin

On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee denny.g@gmail.com wrote:

 I went ahead and tested your file and the results from the tests can be
 seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.

 Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
 the query ran without any issues.  I was able to recreate the issue with
 {Java 7, default}.  I included the commands I used to start the spark-shell
 but basically I just used all defaults (no alteration to driver or executor
 memory) with the only additional call was with driver-class-path to connect
 to MySQL Hive metastore.  This is on OSX Macbook Pro.

 One thing I did notice is that your version of Java 7 is version 51 while
 my version of Java 7 version 79.  Could you see if updating to Java 7
 version 79 perhaps allows you to use the MaxPermSize call?




 On Mon, Jul 6, 2015 at 1:36 PM Simeon Simeonov s...@swoop.com wrote:

  The file is at
 https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1

  The command was included in the gist

  SPARK_REPL_OPTS=-XX:MaxPermSize=256m
 spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
 com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Monday, July 6, 2015 at 12:59 AM
 To: Simeon Simeonov s...@swoop.com
 Cc: Denny Lee denny.g@gmail.com, Andy Huang 
 andy.hu...@servian.com.au, user user@spark.apache.org

 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   I have never seen issue like this. Setting PermGen size to 256m should
 solve the problem. Can you send me your test file and the command used to
 launch the spark shell or your application?

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:

   Yin,

  With 512Mb PermGen, the process still hung and had to be kill -9ed.

  At 1Gb the spark shell  associated processes stopped hanging and
 started exiting with

  scala println(dfCount.first.getLong(0))
 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040)
 called with curMem=0, maxMem=2223023063
 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 229.5 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184)
 called with curMem=235040, maxMem=2223023063
 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added
 broadcast_2_piece0 in memory on localhost:65464 (size: 19.7 KB, free: 2.1
 GB)
 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from
 first at console:30
 java.lang.OutOfMemoryError: PermGen space
 Stopping spark context.
 Exception in thread main
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread main
 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
 broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
 GB)

  That did not change up until 4Gb of PermGen space and 8Gb for driver 
 executor each.

  I stopped at this point because the exercise started looking silly. It
 is clear that 1.4.0 is using memory in a substantially different manner.

  I'd be happy to share the test file so you can reproduce this in your
 own environment.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Sunday, July 5, 2015 at 11:04 PM
 To: Denny Lee denny.g@gmail.com
 Cc: Andy Huang andy.hu...@servian.com.au, Simeon Simeonov 
 s...@swoop.com, user user@spark.apache.org
 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   Sim,

  Can you increase the PermGen size? Please let me know what is your
 setting when the problem disappears.

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:

  I had run into the same problem where everything was working
 swimmingly with Spark 1.3.1.  When I switched to Spark 1.4, either by
 upgrading to Java8 (from Java7) or by knocking up the PermGenSize had
 solved my issue.  HTH!



  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
 wrote:

 We have hit the same issue in spark shell when registering a temp
 table. We observed it happening with those who had JDK 6. The problem went
 away after installing jdk 8. This was only for the tutorial materials 
 which
 was about loading a parquet

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai

I have never seen issue like this. Setting PermGen size to 256m should
solve the problem. Can you send me your test file and the command used to
launch the spark shell or your application?

Thanks,

Yin

On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:

   Yin,

  With 512Mb PermGen, the process still hung and had to be kill -9ed.

  At 1Gb the spark shell  associated processes stopped hanging and
 started exiting with

  scala println(dfCount.first.getLong(0))
 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040) called
 with curMem=0, maxMem=2223023063
 15/07/06 00:10:07 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 229.5 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.MemoryStore: ensureFreeSpace(20184) called
 with curMem=235040, maxMem=2223023063
 15/07/06 00:10:08 INFO storage.MemoryStore: Block broadcast_2_piece0
 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
 15/07/06 00:10:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
 in memory on localhost:65464 (size: 19.7 KB, free: 2.1 GB)
 15/07/06 00:10:08 INFO spark.SparkContext: Created broadcast 2 from first
 at console:30
 java.lang.OutOfMemoryError: PermGen space
 Stopping spark context.
 Exception in thread main
 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread main
 15/07/06 00:10:14 INFO storage.BlockManagerInfo: Removed
 broadcast_2_piece0 on localhost:65464 in memory (size: 19.7 KB, free: 2.1
 GB)

  That did not change up until 4Gb of PermGen space and 8Gb for driver 
 executor each.

  I stopped at this point because the exercise started looking silly. It
 is clear that 1.4.0 is using memory in a substantially different manner.

  I'd be happy to share the test file so you can reproduce this in your
 own environment.

  /Sim

  Simeon Simeonov, Founder  CTO, Swoop http://swoop.com/
 @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746


   From: Yin Huai yh...@databricks.com
 Date: Sunday, July 5, 2015 at 11:04 PM
 To: Denny Lee denny.g@gmail.com
 Cc: Andy Huang andy.hu...@servian.com.au, Simeon Simeonov s...@swoop.com,
 user user@spark.apache.org
 Subject: Re: 1.4.0 regression: out-of-memory errors on small data

   Sim,

  Can you increase the PermGen size? Please let me know what is your
 setting when the problem disappears.

  Thanks,

  Yin

 On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:

  I had run into the same problem where everything was working swimmingly
 with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
 Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
 HTH!



  On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
 wrote:

 We have hit the same issue in spark shell when registering a temp table.
 We observed it happening with those who had JDK 6. The problem went away
 after installing jdk 8. This was only for the tutorial materials which was
 about loading a parquet file.

  Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




  --
  Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
 f: 02 9376 0730| m: 0433221979

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai

Sim,

Can you increase the PermGen size? Please let me know what is your setting
when the problem disappears.

Thanks,

Yin

On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:

 I had run into the same problem where everything was working swimmingly
 with Spark 1.3.1.  When I switched to Spark 1.4, either by upgrading to
 Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
 HTH!



 On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
 wrote:

 We have hit the same issue in spark shell when registering a temp table.
 We observed it happening with those who had JDK 6. The problem went away
 after installing jdk 8. This was only for the tutorial materials which was
 about loading a parquet file.

 Regards
 Andy

 On Sat, Jul 4, 2015 at 2:54 AM, sim s...@swoop.com wrote:

 @bipin, in my case the error happens immediately in a fresh shell in
 1.4.0.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
 f: 02 9376 0730| m: 0433221979

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai

Hi Sim,

Seems you already set the PermGen size to 256m, right? I notice that in
your the shell, you created a HiveContext (it further increased the memory
consumption on PermGen). But, spark shell has already created a HiveContext
for you (sqlContext. You can use asInstanceOf to access HiveContext's
methods). Can you just use the sqlContext created by the shell and try
again?

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai yh...@databricks.com wrote:

Hi Sim,

Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3
(explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you
add --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m in the
command you used to launch Spark shell? This will increase the PermGen size
from 128m (our default) to 256m.

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:40 PM, sim s...@swoop.com wrote:

A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1
and
fails with a series of out-of-memory errors in 1.4.0.

This gist https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4
includes the code and the full output from the 1.3.1 and 1.4.0 runs,
including the command line showing how spark-shell is started.

Should the 1.4.0 spark-shell be started with different options to avoid
this
problem?

Thanks,
Sim

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai

Hi Sim,

Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3
(explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you
add --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m in the
command you used to launch Spark shell? This will increase the PermGen size
from 128m (our default) to 256m.

Thanks,

Yin

On Thu, Jul 2, 2015 at 12:40 PM, sim s...@swoop.com wrote:

 A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1
 and
 fails with a series of out-of-memory errors in 1.4.0.

 This gist https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4
 includes the code and the full output from the 1.3.1 and 1.4.0 runs,
 including the command line showing how spark-shell is started.

 Should the 1.4.0 spark-shell be started with different options to avoid
 this
 problem?

 Thanks,
 Sim




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Yin Huai

The function accepted by explode is f: Row = TraversableOnce[A]. Seems
user_mentions is an array of structs. So, can you change your
pattern matching to the following?

case Row(rows: Seq[_]) = rows.asInstanceOf[Seq[Row]].map(elem = ...)

On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones garjo...@socialmetrix.com
wrote:

 Hi All,

 I am using the new *Apache Spark version 1.4.0 Data-frames API* to
 extract information from Twitter's Status JSON, mostly focused on the Entities
 Object https://dev.twitter.com/overview/api/entities - the relevant
 part to this question is showed below:

 {
   ...
   ...
   entities: {
 hashtags: [],
 trends: [],
 urls: [],
 user_mentions: [
   {
 screen_name: linobocchini,
 name: Lino Bocchini,
 id: 187356243,
 id_str: 187356243,
 indices: [ 3, 16 ]
   },
   {
 screen_name: jeanwyllys_real,
 name: Jean Wyllys,
 id: 23176,
 id_str: 23176,
 indices: [ 79, 95 ]
   }
 ],
 symbols: []
   },
   ...
   ...
 }

 There are several examples on how extract information from primitives
 types as string, integer, etc - but I couldn't find anything on how to
 process those kind of *complex* structures.

 I tried the code below but it is still doesn't work, it throws an Exception

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

 val tweets = sqlContext.read.json(tweets.json)

 // this function is just to filter empty entities.user_mentions[] nodes
 // some tweets doesn't contains any mentions
 import org.apache.spark.sql.functions.udf
 val isEmpty = udf((value: List[Any]) = value.isEmpty)

 import org.apache.spark.sql._
 import sqlContext.implicits._
 case class UserMention(id: Long, idStr: String, indices: Array[Long], name: 
 String, screenName: String)

 val mentions = tweets.select(entities.user_mentions).
   filter(!isEmpty($user_mentions)).
   explode($user_mentions) {
   case Row(arr: Array[Row]) = arr.map { elem =
 UserMention(
   elem.getAs[Long](id),
   elem.getAs[String](is_str),
   elem.getAs[Array[Long]](indices),
   elem.getAs[String](name),
   elem.getAs[String](screen_name))
   }
 }

 mentions.first

 Exception when I try to call mentions.first:

 scala mentions.first
 15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
 scala.MatchError: [List([187356243,187356243,List(3, 16),Lino 
 Bocchini,linobocchini], [23176,23176,List(79, 95),Jean 
 Wyllys,jeanwyllys_real])] (of class 
 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
 at 
 $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34)
 at 
 $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34)
 at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
 at 
 org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)

 What is wrong here? I understand it is related to the types but I couldn't
 figure out it yet.

 As additional context, the structure mapped automatically is:

 scala mentions.printSchema
 root
  |-- user_mentions: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- id: long (nullable = true)
  |||-- id_str: string (nullable = true)
  |||-- indices: array (nullable = true)
  ||||-- element: long (containsNull = true)
  |||-- name: string (nullable = true)
  |||-- screen_name: string (nullable = true)

 *NOTE 1:* I know it is possible to solve this using HiveQL but I would
 like to use Data-frames once there is so much momentum around it.

 SELECT explode(entities.user_mentions) as mentions
 FROM tweets

 *NOTE 2:* the *UDF* val isEmpty = udf((value: List[Any]) = value.isEmpty) is
 a ugly hack and I'm missing something here, but was the only way I came up
 to avoid a NPE

 I’ve posted the same question on SO:
 http://stackoverflow.com/questions/31016156/how-to-extract-complex-json-structures-using-apache-spark-1-4-0-data-frames

 Thanks all!
 - gustavo

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai

Hi James,

Maybe it's the DISTINCT causing the issue.

I rewrote the query as follows. Maybe this one can finish faster.

select
  sum(cnt) as uses,
  count(id) as users
from (
  select
count(*) cnt,
cast(id as string) as id,
  from usage_events
  where
from_unixtime(cast(timestamp_millis/1000 as bigint)) between
'2015-06-09' and '2015-06-16'
  group by cast(id as string)
) tmp

Thanks,

Yin

On Mon, Jun 22, 2015 at 12:55 PM, Jörn Franke jornfra...@gmail.com wrote:

 Generally (not only spark sql specific) you should not cast in the where
 part of a sql query. It is also not necessary in your case. Getting rid of
 casts in the whole query will be also beneficial.

 Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a
 écrit :

 Hello,

 A colleague of mine ran the following Spark SQL query:

 select
   count(*) as uses,
   count (distinct cast(id as string)) as users
 from usage_events
 where
   from_unixtime(cast(timestamp_millis/1000 as bigint))
 between '2015-06-09' and '2015-06-16'

 The table contains billions of rows, but totals only 64GB of data across
 ~30 separate files, which are stored as Parquet with LZO compression in S3.

 From the referenced columns:

 * id is Binary, which we cast to a String so that we can DISTINCT by it.
 (I was already told this will improve in a later release, in a separate
 thread.)
 * timestamp_millis is a long, containing a unix timestamp with
 millisecond resolution

 This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
 instances, using 20 executors, each with 4GB memory. I can see from
 monitoring tools that the CPU usage is at 100% on all nodes, but incoming
 network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.

 Does that seem slow? Can anyone offer any ideas by glancing at the query
 as to why this might be slow? We'll profile it meanwhile and post back if
 we find anything ourselves.

 A side issue - I've found that this query, and others, sometimes
 completes but doesn't return any results. There appears to be no error that
 I can see in the logs, and Spark reports the job as successful, but the
 connected JDBC client (SQLWorkbenchJ in this case), just sits there forever
 waiting. I did a quick Google and couldn't find anyone else having similar
 issues.


 Many thanks,

 James.

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai

btw, user listt will be a better place for this thread.

On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai yh...@databricks.com wrote:

 Is it the full stack trace?

 On Thu, Jun 18, 2015 at 6:39 AM, Sea 261810...@qq.com wrote:

 Hi, all:

 I want to run spark sql on yarn(yarn-client), but ... I already set
 spark.yarn.jar and  spark.jars in conf/spark-defaults.conf.

 ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100  
 game.txt

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/deploy/yarn/ExecutorLauncher
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.deploy.yarn.ExecutorLauncher
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class:
 org.apache.spark.deploy.yarn.ExecutorLauncher.  Program will exit.


 Anyone can help?

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai

Are you writing to an existing hive orc table?

On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote:

 Thanks for reporting this. Would you mind to help creating a JIRA for this?


 On 6/16/15 2:25 AM, patcharee wrote:

 I found if I move the partitioned columns in schemaString and in Row to
 the end of the sequence, then it works correctly...

 On 16. juni 2015 11:14, patcharee wrote:

 Hi,

 I am using spark 1.4 and HiveContext to append data into a partitioned
 hive table. I found that the data insert into the table is correct, but the
 partition(folder) created is totally wrong.
 Below is my code snippet

 ---

 val schemaString = zone z year month date hh x y height u v w ph phb t
 p pb qvapor qgraup qnice qnrain tke_pbl el_pbl
 val schema =
   StructType(
 schemaString.split( ).map(fieldName =
   if (fieldName.equals(zone) || fieldName.equals(z) ||
 fieldName.equals(year) || fieldName.equals(month) ||
   fieldName.equals(date) || fieldName.equals(hh) ||
 fieldName.equals(x) || fieldName.equals(y))
 StructField(fieldName, IntegerType, true)
   else
 StructField(fieldName, FloatType, true)
 ))

 val pairVarRDD =
 sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),

 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),

 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
 ))

 val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)

 partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource)

 .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark)


 ---


 The table contains 23 columns (longer than Tuple maximum length), so I
 use Row Object to store raw data, not Tuple.
 Here is some message from spark when it saved data

 15/06/16 10:39:22 INFO metadata.Hive: Renaming
 src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
 hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true

 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
 hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
 with partSpec {zone=13195, z=0, year=0, month=0}

 From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3.
 But spark created a partition {zone=13195, z=0, year=0, month=0}.

 When I queried from hive

 hive select * from test4dimBySpark;
 OK
 242200931.00.0218.0365.09989.497
 29.62711319.0717930.11982734-3174.681297735.2 16.389032
 -96.6289125135.3652.6476808E-50.0 13195000
 hive select zone, z, year, month from test4dimBySpark;
 OK
 13195000
 hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
 Found 2 items
 -rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39
 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1

 The data stored in the table is correct zone = 2, z = 42, year = 2009,
 month = 3, but the partition created was wrong
 zone=13195/z=0/year=0/month=0

 Is this a bug or what could be wrong? Any suggestion is appreciated.

 BR,
 Patcharee







 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai

If you are writing to an existing hive table, our insert into operator
follows hive's requirement, which is
*the dynamic partition columns must be specified last among the columns in
the SELECT statement and in the same order** in which they appear in the
PARTITION() clause*.

You can find requirement in
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions.

If you use select to reorder columns, I think it should work. Also, since
the table is an existing hive table, you do not need to specify the format
because we will use the format of existing table.

btw, please feel free to open a jira about removing this requirement for
inserting into an existing hive table.

Thanks,

Yin

On Thu, Jun 18, 2015 at 9:39 PM, Yin Huai yh...@databricks.com wrote:

 Are you writing to an existing hive orc table?

 On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote:

 Thanks for reporting this. Would you mind to help creating a JIRA for
 this?


 On 6/16/15 2:25 AM, patcharee wrote:

 I found if I move the partitioned columns in schemaString and in Row to
 the end of the sequence, then it works correctly...

 On 16. juni 2015 11:14, patcharee wrote:

 Hi,

 I am using spark 1.4 and HiveContext to append data into a partitioned
 hive table. I found that the data insert into the table is correct, but the
 partition(folder) created is totally wrong.
 Below is my code snippet

 ---

 val schemaString = zone z year month date hh x y height u v w ph phb t
 p pb qvapor qgraup qnice qnrain tke_pbl el_pbl
 val schema =
   StructType(
 schemaString.split( ).map(fieldName =
   if (fieldName.equals(zone) || fieldName.equals(z) ||
 fieldName.equals(year) || fieldName.equals(month) ||
   fieldName.equals(date) || fieldName.equals(hh) ||
 fieldName.equals(x) || fieldName.equals(y))
 StructField(fieldName, IntegerType, true)
   else
 StructField(fieldName, FloatType, true)
 ))

 val pairVarRDD =
 sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),

 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),

 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
 ))

 val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)

 partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource)

 .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark)


 ---


 The table contains 23 columns (longer than Tuple maximum length), so I
 use Row Object to store raw data, not Tuple.
 Here is some message from spark when it saved data

 15/06/16 10:39:22 INFO metadata.Hive: Renaming
 src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
 hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true

 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
 hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
 with partSpec {zone=13195, z=0, year=0, month=0}

 From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =
 3. But spark created a partition {zone=13195, z=0, year=0, month=0}.

 When I queried from hive

 hive select * from test4dimBySpark;
 OK
 242200931.00.0218.0365.09989.497
 29.62711319.0717930.11982734-3174.681297735.2 16.389032
 -96.6289125135.3652.6476808E-50.0 13195000
 hive select zone, z, year, month from test4dimBySpark;
 OK
 13195000
 hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
 Found 2 items
 -rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39
 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1

 The data stored in the table is correct zone = 2, z = 42, year = 2009,
 month = 3, but the partition created was wrong
 zone=13195/z=0/year=0/month=0

 Is this a bug or what could be wrong? Any suggestion is appreciated.

 BR,
 Patcharee







 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai

Hi John,

Did you also set spark.sql.planner.externalSort to true? Probably you will
not see executor lost with this conf. For now, maybe you can manually split
the query to two parts, one for skewed keys and one for other records.
Then, you union then results of these two parts together.

Thanks,

Yin

On Wed, Jun 17, 2015 at 9:53 AM, Koert Kuipers ko...@tresata.com wrote:

 could it be composed maybe? a general version and then a sql version that
 exploits the additional info/abilities available there and uses the general
 version internally...

 i assume the sql version can benefit from the logical phase optimization
 to pick join details. or is there more?

 On Tue, Jun 16, 2015 at 7:37 PM, Michael Armbrust mich...@databricks.com
 wrote:

 this would be a great addition to spark, and ideally it belongs in spark
 core not sql.


 I agree with the fact that this would be a great addition, but we would
 likely want a specialized SQL implementation for performance reasons.

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Yin Huai

So, the second attemp of those tasks failed with NPE can complete and the
job eventually finished?

On Mon, Jun 15, 2015 at 10:37 PM, Night Wolf nightwolf...@gmail.com wrote:

 Hey Yin,

 Thanks for the link to the JIRA. I'll add details to it. But I'm able to
 reproduce it, at least in the same shell session, every time I do a write I
 get a random number of tasks failing on the first run with the NPE.

 Using dynamic allocation of executors in YARN mode. No speculative
 execution is enabled.

 On Tue, Jun 16, 2015 at 3:11 PM, Yin Huai yh...@databricks.com wrote:

 I saw it once but I was not clear how to reproduce it. The jira I created
 is https://issues.apache.org/jira/browse/SPARK-7837.

 More information will be very helpful. Were those errors from speculative
 tasks or regular tasks (the first attempt of the task)? Is this error
 deterministic (can you reproduce every time you run this command)?

 Thanks,

 Yin

 On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf nightwolf...@gmail.com
 wrote:

 Looking at the logs of the executor, looks like it fails to find the
 file; e.g. for task 10323.0


 15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException
 trying to rename
 maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161340__m_010181_0/part-r-353626.gz.parquet
 to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353626.gz.parquet
 java.io.IOException: Invalid source or target
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
 at
 org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
 at
 org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/06/16 13:43:13 ERROR mapred.SparkHadoopMapRedUtil: Error committing
 the output of task: attempt_201506161340__m_010181_0
 java.io.IOException: Invalid source or target
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
 at
 org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
 at
 org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/06/16 13:43:16 ERROR output.FileOutputCommitter: Hit IOException
 trying to rename
 maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161341__m_010323_0/part-r-353768.gz.parquet
 to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353768.gz.parquet
 java.io.IOException: Invalid

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Yin Huai

I saw it once but I was not clear how to reproduce it. The jira I created
is https://issues.apache.org/jira/browse/SPARK-7837.

More information will be very helpful. Were those errors from speculative
tasks or regular tasks (the first attempt of the task)? Is this error
deterministic (can you reproduce every time you run this command)?

Thanks,

Yin

On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf nightwolf...@gmail.com wrote:

 Looking at the logs of the executor, looks like it fails to find the file;
 e.g. for task 10323.0


 15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException trying
 to rename
 maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161340__m_010181_0/part-r-353626.gz.parquet
 to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353626.gz.parquet
 java.io.IOException: Invalid source or target
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
 at
 org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
 at
 org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/06/16 13:43:13 ERROR mapred.SparkHadoopMapRedUtil: Error committing the
 output of task: attempt_201506161340__m_010181_0
 java.io.IOException: Invalid source or target
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:137)
 at
 org.apache.spark.sql.sources.BaseWriterContainer.commitTask(commands.scala:357)
 at
 org.apache.spark.sql.sources.DefaultWriterContainer.commitTask(commands.scala:394)
 at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org
 $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:157)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/06/16 13:43:16 ERROR output.FileOutputCommitter: Hit IOException trying
 to rename
 maprfs:///user/hive/warehouse/is_20150617_test2/_temporary/_attempt_201506161341__m_010323_0/part-r-353768.gz.parquet
 to maprfs:/user/hive/warehouse/is_20150617_test2/part-r-353768.gz.parquet
 java.io.IOException: Invalid source or target
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:952)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:201)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:225)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:100)
 at

Re: Issues with `when` in Column class

2015-06-12 Thread Yin Huai

Hi Chris,

Have you imported org.apache.spark.sql.functions._?

Thanks,

Yin

On Fri, Jun 12, 2015 at 8:05 AM, Chris Freeman cfree...@alteryx.com wrote:

  I’m trying to iterate through a list of Columns and create new Columns
 based on a condition. However, the when method keeps giving me errors that
 don’t quite make sense.

  If I do `when(col === “abc”, 1).otherwise(0)` I get the following error
 at compile time:

  [error] not found: value when

  However, this works in the REPL just fine after I import
 org.apache.spark.sql.Column.

  On the other hand, if I do `col.when(col === “abc”, 1).otherwise(0)`, it
 will compile successfully, but then at runtime, I get this error:

  java.lang.IllegalArgumentException: when() can only be applied on a
 Column previously generated by when() function

  This appears to be pretty circular logic. How can `when` only be applied
 to a Column previously generated by `when`? How would I call `when` in the
 first place?

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-05 Thread Yin Huai

Hi Doug,

For now, I think you can use sqlContext.sql(USE databaseName) to change
the current database.

Thanks,

Yin

On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai yh...@databricks.com wrote:

 Hi Doug,

 sqlContext.table does not officially support database name. It only
 supports table name as the parameter. We will add a method to support
 database name in future.

 Thanks,

 Yin

 On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog doug.sparku...@dugos.com
 wrote:

 Hi Yin,
  I’m very surprised to hear that its not supported in 1.3 because I’ve
 been using it since 1.3.0.
 It worked great up until  SPARK-6908 was merged into master.

 What is the supported way to get  DF for a table that is not in the
 default database ?

 IMHO, If you are not going to support “databaseName.tableName”,
 sqlContext.table() should have a version that takes a database and a table,
 ie

 def table(databaseName: String, tableName: String): DataFrame =
   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))

 The handling of databases in Spark(sqlContext, hiveContext, Catalog)
 could be better.

 Thanks,

 Doug

  On Jun 3, 2015, at 8:21 PM, Yin Huai yh...@databricks.com wrote:
 
  Hi Doug,
 
  Actually, sqlContext.table does not support database name in both Spark
 1.3 and Spark 1.4. We will support it in future version.
 
  Thanks,
 
  Yin
 
 
 
  On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com
 wrote:
  Hi,
 
  sqlContext.table(“db.tbl”) isn’t working for me, I get a
 NoSuchTableException.
 
  But I can access the table via
 
  sqlContext.sql(“select * from db.tbl”)
 
  So I know it has the table info from the metastore.
 
  Anyone else see this ?
 
  I’ll keep digging.
  I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
 -Phive-thriftserver
  It worked for me in 1.3.1
 
  Cheers,
 
  Doug
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai

Are you using RC4?

On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf nightwolf...@gmail.com wrote:

 Thanks Yin, that seems to work with the Shell. But on a compiled
 application with Spark-submit it still fails with the same exception.

 On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com wrote:

 Can you put the following setting in spark-defaults.conf and try again?

 spark.sql.hive.metastore.sharedPrefixes
 com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni

 https://issues.apache.org/jira/browse/SPARK-7819 has more context about
 it.

 On Wed, Jun 3, 2015 at 9:38 PM, Night Wolf nightwolf...@gmail.com
 wrote:

 Hi all,

 Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client
 mode with Hive support.

 *Build command;*
 ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn
 -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided
 -Dhadoop.version=2.5.1-mapr-1501 -Dyarn.version=2.5.1-mapr-1501 -DskipTests
 -e -X


 When trying to run a hive query in the spark shell *sqlContext.sql(show
 tables)* I get the following exception;

 scala sqlContext.sql(show tables)
 15/06/04 04:33:16 INFO hive.HiveContext: Initializing
 HiveMetastoreConnection version 0.13.1 using Spark classes.
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.mapr.fs.ShimLoader.loadNativeLibrary(ShimLoader.java:323)
 at com.mapr.fs.ShimLoader.load(ShimLoader.java:198)
 at
 org.apache.hadoop.conf.CoreDefaultProperties.clinit(CoreDefaultProperties.java:59)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at
 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1857)
 at
 org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2072)
 at
 org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2282)
 at
 org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2234)
 at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2151)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:1002)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:974)
 at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518)
 at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536)
 at org.apache.hadoop.mapred.JobConf.init(JobConf.java:430)
 at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:1366)
 at org.apache.hadoop.hive.conf.HiveConf.init(HiveConf.java:1332)
 at
 org.apache.spark.sql.hive.client.ClientWrapper.init(ClientWrapper.scala:99)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.init(IsolatedClientLoader.scala:166)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at
 org.apache.spark.sql.hive.HiveContext$$anon$2.init(HiveContext.scala:370)
 at
 org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:370)
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:369)
 at
 org.apache.spark.sql.hive.HiveContext$$anon$1.init(HiveContext.scala:382)
 at
 org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:382)
 at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:381)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:901)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
 at
 $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:21)
 at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:30)
 at $line37.$read$$iwC$$iwC$$iwC$$iwC.init(console:32)
 at $line37.$read$$iwC$$iwC$$iwC.init(console:34)
 at $line37.$read$$iwC$$iwC.init(console:36)
 at $line37.$read$$iwC.init(console:38)
 at $line37.$read.init(console:40)
 at $line37.$read$.init(console:44)
 at $line37.$read$.clinit(console)
 at $line37.$eval$.init(console:7)
 at $line37.$eval$.clinit

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-04 Thread Yin Huai

Hi Doug,

sqlContext.table does not officially support database name. It only
supports table name as the parameter. We will add a method to support
database name in future.

Thanks,

Yin

On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog doug.sparku...@dugos.com wrote:

 Hi Yin,
  I’m very surprised to hear that its not supported in 1.3 because I’ve
 been using it since 1.3.0.
 It worked great up until  SPARK-6908 was merged into master.

 What is the supported way to get  DF for a table that is not in the
 default database ?

 IMHO, If you are not going to support “databaseName.tableName”,
 sqlContext.table() should have a version that takes a database and a table,
 ie

 def table(databaseName: String, tableName: String): DataFrame =
   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))

 The handling of databases in Spark(sqlContext, hiveContext, Catalog) could
 be better.

 Thanks,

 Doug

  On Jun 3, 2015, at 8:21 PM, Yin Huai yh...@databricks.com wrote:
 
  Hi Doug,
 
  Actually, sqlContext.table does not support database name in both Spark
 1.3 and Spark 1.4. We will support it in future version.
 
  Thanks,
 
  Yin
 
 
 
  On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com
 wrote:
  Hi,
 
  sqlContext.table(“db.tbl”) isn’t working for me, I get a
 NoSuchTableException.
 
  But I can access the table via
 
  sqlContext.sql(“select * from db.tbl”)
 
  So I know it has the table info from the metastore.
 
  Anyone else see this ?
 
  I’ll keep digging.
  I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
 -Phive-thriftserver
  It worked for me in 1.3.1
 
  Cheers,
 
  Doug
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-03 Thread Yin Huai

Hi Doug,

Actually, sqlContext.table does not support database name in both Spark 1.3
and Spark 1.4. We will support it in future version.

Thanks,

Yin



On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com
wrote:

 Hi,

 sqlContext.table(“db.tbl”) isn’t working for me, I get a
 NoSuchTableException.

 But I can access the table via

 sqlContext.sql(“select * from db.tbl”)

 So I know it has the table info from the metastore.

 Anyone else see this ?

 I’ll keep digging.
 I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
 -Phive-thriftserver
 It worked for me in 1.3.1

 Cheers,

 Doug


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai

Does it happen every time you read a parquet source?

On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg arp...@spotify.com wrote:

 The log is from the log aggregation tool (hortonworks, yarn logs ...),
 so both executors and driver. I'll send a private mail to you with the full
 logs. Also, tried another job as you suggested, and it actually worked
 fine. The first job was reading from a parquet source, and the second from
 an avro source. Could there be some issues with the parquet reader?

 Thanks,
 Anders

 On Tue, Jun 2, 2015 at 11:53 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 How about other jobs? Is it an executor log, or a driver log? Could you
 post other logs near this error, please? Thank you.

 Best Regards,
 Shixiong Zhu

 2015-06-02 17:11 GMT+08:00 Anders Arpteg arp...@spotify.com:

 Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that
 worked fine for Spark 1.3. The job starts on the cluster (yarn-cluster
 mode), initial stage starts, but the job fails before any task succeeds
 with the following error. Any hints?

 [ERROR] [06/02/2015 09:05:36.962] [Executor task launch worker-0]
 [akka.tcp://sparkDriver@10.254.6.15:33986/user/CoarseGrainedScheduler]
 swallowing exception during message send
 (akka.remote.RemoteTransportExceptionNoStackTrace)
 Exception in thread main akka.actor.ActorNotFound: Actor not found
 for: ActorSelection[Anchor(akka.tcp://sparkDriver@10.254.6.15:33986/),
 Path(/user/OutputCommitCoordinator)]
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at
 akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at
 akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
 at
 scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 at
 scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
 at
 akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
 at
 akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
 at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
 at
 akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
 at
 akka.remote.EndpointManager$$anonfun$1.applyOrElse(Remoting.scala:575)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

Re: Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-30 Thread Yin Huai

Looks like your program somehow picked up a older version of parquet (spark
1.3.1 uses parquet 1.6.0rc3 and seems NO_FILTER field was introduced in
1.6.0rc2). Is it possible that you can check the parquet lib version in
your classpath?

Thanks,

Yin

On Sat, May 30, 2015 at 2:44 PM, ogoh oke...@gmail.com wrote:


 I had the same issue on AWS EMR with Spark 1.3.1.e (AWS version) passed
 with
 '-h' parameter (it is bootstrap action parameter for spark).
 I don't see the problem with Spark 1.3.1.e not passing the parameter.
 I am not sure about your env.
 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-1-3-1-produces-java-lang-NoSuchFieldError-NO-FILTER-tp22897p23090.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai

Hi Cesar,

We just added it in Spark 1.4.

In Spark 1.4, You can use window function in HiveContext to do it. Assuming
you want to calculate the cumulative sum for every flag,

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df.select(
  $flag,
  $price,

sum($price).over(Window.partitionBy(flag).orderBy(price).rowsBetween(Long.MinValue,
0


In the code, over lets Spark SQL knows that you want to use window function
sum. partitionBy(flag) will partition the table by the value of flag and
the sum's scope is a single partition. orderBy(price) will sort rows in a
partition based on the value of price (probably this does not really matter
for your case. But using orderBy will make the result deterministic).
Finally, rowsBetween(Long.MinValue, 0) means that the sum value for every
row is calculated from price values of the first row in the partition to
the current row (so, you get the cumulative sum).

Thanks,

Yin

On Fri, May 29, 2015 at 8:09 AM, Cesar Flores ces...@gmail.com wrote:

 What will be the more appropriate method to add a cumulative sum column to
 a data frame. For example, assuming that I have the next data frame:

 flag | price
 --
 1|47.808764653746
 1|47.808764653746
 1|31.9869279512204


 How can I create a data frame with an extra cumsum column as the next one:

 flag | price  | cumsum_price
 --|---
 1|47.808764653746 | 47.808764653746
 1|47.808764653746 | 95.6175293075
 1|31.9869279512204| 127.604457259


 Thanks
 --
 Cesar Flores

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai

Can you attach the full stack trace?

Thanks,

Yin

On Fri, May 8, 2015 at 4:44 PM, barmaley o...@solver.com wrote:

 Given a registered table from data frame, I'm able to execute queries like
 sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just
 fine.
 However, when I run exactly the same code in a standalone app on a cluster,
 it throws an exception: java.util.NoSuchElementException: key not found:
 STDDEV...

 Is STDDEV ia among default functions in Spark SQL? I'd appreciate if you
 could comment what's going on with the above.

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-STDDEV-working-in-Spark-Shell-but-not-in-a-standalone-app-tp22825.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai

oh, I missed that. It is fixed in 1.3.0.

Also, Jianshi, the dataset was not generated by Spark SQL, right?

On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu yuzhih...@gmail.com wrote:

 Yin:
 Fix Version of SPARK-4520 is not set.
 I assume it was fixed in 1.3.0

 Cheers
 Fix Version

 On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote:

 The exception looks like the one mentioned in
 https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
 Spark?

 On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 My data looks like this:

 +---++--+
 | col_name  | data_type  | comment  |
 +---++--+
 | cust_id   | string |  |
 | part_num  | int|  |
 | ip_list   | arraystructip:string   |  |
 | vid_list  | arraystructvid:string  |  |
 | fso_list  | arraystructfso:string  |  |
 | src   | string |  |
 | date  | int|  |
 +---++--+

 And I did select *, it reports ParquetDecodingException.

 Is this type not supported in SparkSQL?

 Detailed error message here:


 Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): 
 parquet.io.ParquetDecodingException:
 Can not read value at 0 in block -1 in file 
 hdfs://xxx/part-m-0.gz.parquet
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at 
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:400)
 at java.util.ArrayList.get(ArrayList.java:413)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
 at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
 at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:290)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
 at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
 at 
 parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
 at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126

Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai

Hi Sergio,

I missed this thread somehow... For the error case classes cannot have
more than 22 parameters., it is the limitation of scala (see
https://issues.scala-lang.org/browse/SI-7296). You can follow the
instruction at
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
to create a table with more than 22 columns. Basically, you first create an
RDD[Row] and a schema of the table represented by StructType. Then, you use
createDataFrame to apply the schema.

Thanks,

Yin

On Fri, Apr 24, 2015 at 10:44 AM, Sergio Jiménez Barrio 
drarse.a...@gmail.com wrote:

 Solved! I have solved the problem combining both solutions. The result is
 this:

 messages.foreachRDD { rdd =
   val message: RDD[String] = rdd.map { y = y._2 }
   val sqlContext = 
 SQLContextSingleton.getInstance(rdd.sparkContext)
   import sqlContext.implicits._
   val df :DataFrame = sqlContext.jsonRDD(message).toDF()
   df.groupBy(classification).count().show()
   println()
 }





 With the SQLContextSingleton the function of Spark Documentation
 Thanks for all!



 2015-04-23 10:29 GMT+02:00 Sergio Jiménez Barrio drarse.a...@gmail.com:

 Thank you ver much, Tathagata!


 El miércoles, 22 de abril de 2015, Tathagata Das t...@databricks.com
 escribió:

 Aaah, that. That is probably a limitation of the SQLContext (cc'ing Yin
 for more information).


 On Wed, Apr 22, 2015 at 7:07 AM, Sergio Jiménez Barrio 
 drarse.a...@gmail.com wrote:

 Sorry, this is the error:

 [error] /home/sergio/Escritorio/hello/streaming.scala:77:
 Implementation restriction: case classes cannot have more than 22
 parameters.



 2015-04-22 16:06 GMT+02:00 Sergio Jiménez Barrio drarse.a...@gmail.com
 :

 I tried the solution of the guide, but I exceded the size of case
 class Row:


 2015-04-22 15:22 GMT+02:00 Tathagata Das tathagata.das1...@gmail.com
 :

 Did you checkout the latest streaming programming guide?


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations

 You also need to be aware of that to convert json RDDs to dataframe,
 sqlContext has to make a pass on the data to learn the schema. This will
 fail if a batch has no data. You have to safeguard against that.

 On Wed, Apr 22, 2015 at 6:19 AM, ayan guha guha.a...@gmail.com
 wrote:

 What about sqlcontext.createDataframe(rdd)?
 On 22 Apr 2015 23:04, Sergio Jiménez Barrio drarse.a...@gmail.com
 wrote:

 Hi,

 I am using Kafka with Apache Stream to send JSON to Apache Spark:

 val messages = KafkaUtils.createDirectStream[String, String, 
 StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

 Now, I want parse the DStream created to DataFrame, but I don't
 know if Spark 1.3 have some easy way for this. ¿Any suggestion? I can 
 get
 the message with:

 val lines = messages.map(_._2)

 Thank u for all. Sergio J.








 --
 Atte. Sergio Jiménez

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai

The exception looks like the one mentioned in
https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
Spark?

On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 My data looks like this:

 +---++--+
 | col_name  | data_type  | comment  |
 +---++--+
 | cust_id   | string |  |
 | part_num  | int|  |
 | ip_list   | arraystructip:string   |  |
 | vid_list  | arraystructvid:string  |  |
 | fso_list  | arraystructfso:string  |  |
 | src   | string |  |
 | date  | int|  |
 +---++--+

 And I did select *, it reports ParquetDecodingException.

 Is this type not supported in SparkSQL?

 Detailed error message here:


 Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): 
 parquet.io.ParquetDecodingException:
 Can not read value at 0 in block -1 in file hdfs://xxx/part-m-0.gz.parquet
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:400)
 at java.util.ArrayList.get(ArrayList.java:413)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
 at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
 at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:290)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
 at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
 at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
 at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai

Hi Shuai,

You can use as to create a table alias. For example, df1.as(df1). Then
you can use $df1.col to refer it.

Thanks,

Yin

On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng szheng.c...@gmail.com wrote:

 Hi All,



 I use 1.3.1



 When I have two DF and join them on a same name key, after that, I can’t
 get the common key by name.



 Basically:

 select * from t1 inner join t2 on t1.col1 = t2.col1



 And I am using purely DataFrame, spark SqlContext not HiveContext



 DataFrame df3 = df1.join(df2, df1.col(col).equalTo(df2.col(col))).select(
 *col*);



 because df1 and df2 join on the same key col,



 Then I can't reference the key col. I understand I should use a full
 qualified name for that column (like in SQL, use t1.col), but I don’t know
 how should I address this in spark sql.



 Exception in thread main org.apache.spark.sql.AnalysisException:
 Reference 'id' is ambiguous, could be: id#8L, id#0L.;



 It looks that joined key can't be referenced by name or by df1.col name
 pattern.

 The https://issues.apache.org/jira/browse/SPARK-5278 refer to a hive
 case, so I am not sure whether it is the same issue, but I still have the
 issue in latest code.



 It looks like the result after join won't keep the parent DF information
 anywhere?



 I check the ticket: https://issues.apache.org/jira/browse/SPARK-6273



 But not sure whether  it is the same issue? Should I open a new ticket for
 this?



 Regards,



 Shuai

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai

Xudong and Rex,

Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 ,
after we get a hive parquet from metastore and convert it to our native
parquet code path, we will cache the converted relation. For now, the first
access to that hive parquet table reads all of the footers (when you first
refer to that table in a query or call
sqlContext.table(hiveParquetTable)). All of your later accesses will hit
the metadata cache.

Thanks,

Yin

On Tue, Apr 21, 2015 at 1:13 AM, Rex Xiong bycha...@gmail.com wrote:

 We have the similar issue with massive parquet files, Cheng Lian, could
 you have a look?

 2015-04-08 15:47 GMT+08:00 Zheng, Xudong dong...@gmail.com:

 Hi Cheng,

 I tried both these patches, and seems still not resolve my issue. And I
 found the most time is spend on this line in newParquet.scala:

 ParquetFileReader.readAllFootersInParallel(
   sparkContext.hadoopConfiguration, seqAsJavaList(leaves),
 taskSideMetaData)

 Which need read all the files under the Parquet folder, while our Parquet
 folder has a lot of Parquet files (near 2000), read one file need about 2
 seconds, so it become very slow ... And the PR 5231 did not skip this steps
 so it not resolve my issue.

 As our Parquet files are generated by a Spark job, so the number of
 .parquet files is same with the number of tasks, that is why we have so
 many files. But these files actually have the same schema. Is there any way
 to merge these files into one, or avoid scan each of them?

 On Sat, Apr 4, 2015 at 9:47 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hey Xudong,

 We had been digging this issue for a while, and believe PR 5339
 http://github.com/apache/spark/pull/5339 and PR 5334
 http://github.com/apache/spark/pull/5339 should fix this issue.

 There two problems:

 1. Normally we cache Parquet table metadata for better performance, but
 when converting Hive metastore Hive tables, the cache is not used. Thus
 heavy operations like schema discovery is done every time a metastore
 Parquet table is converted.
 2. With Parquet task side metadata reading (which is turned on by
 default), we can actually skip the row group information in the footer.
 However, we accidentally called a Parquet function which doesn't skip row
 group information.

 For your question about schema merging, Parquet allows different
 part-files have different but compatible schemas. For example,
 part-1.parquet has columns a and b, while part-2.parquet may has
 columns a and c. In some cases, the summary files (_metadata and
 _common_metadata) contains the merged schema (a, b, and c), but it's not
 guaranteed. For example, when the user defined metadata stored different
 part-files contain different values for the same key, Parquet simply gives
 up writing summary files. That's why all part-files must be touched to get
 a precise merged schema.

 However, in scenarios where a centralized arbitrative schema is
 available (e.g. Hive metastore schema, or the schema provided by user via
 data source DDL), we don't need to do schema merging on driver side, but
 defer it to executor side and each task only needs to reconcile those
 part-files it needs to touch. This is also what the Parquet developers did
 recently for parquet-hadoop
 https://github.com/apache/incubator-parquet-mr/pull/45.

 Cheng


 On 3/31/15 11:49 PM, Zheng, Xudong wrote:

 Thanks Cheng!

  Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues,
 but the PR 5231 seems not. Not sure any other things I did wrong ...

  BTW, actually, we are very interested in the schema merging feature in
 Spark 1.3, so both these two solution will disable this feature, right? It
 seems that Parquet metadata is store in a file named _metadata in the
 Parquet file folder (each folder is a partition as we use partition table),
 why we need scan all Parquet part files? Is there any other solutions could
 keep schema merging feature at the same time? We are really like this
 feature :)

 On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  Hi Xudong,

 This is probably because of Parquet schema merging is turned on by
 default. This is generally useful for Parquet files with different but
 compatible schemas. But it needs to read metadata from all Parquet
 part-files. This can be problematic when reading Parquet files with lots of
 part-files, especially when the user doesn't need schema merging.

 This issue is tracked by SPARK-6575, and here is a PR for it:
 https://github.com/apache/spark/pull/5231. This PR adds a
 configuration to disable schema merging by default when doing Hive
 metastore Parquet table conversion.

 Another workaround is to fallback to the old Parquet code by setting
 spark.sql.parquet.useDataSourceApi to false.

 Cheng


 On 3/31/15 2:47 PM, Zheng, Xudong wrote:

 Hi all,

  We are using Parquet Hive table, and we are upgrading to Spark 1.3.
 But we find that, just a simple COUNT(*) query will much slower (100x) than
 Spark

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai

Hi Cesar,

Can you try 1.3.1 (
https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it
still shows the error?

Thanks,

Yin

On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote:

 This is strange. cc the dev list since it might be a bug.



 On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote:

 Never mind. I found the solution:

 val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
 hiveLoadedDataFrame.schema)

 which translate to convert the data frame to rdd and back again to data
 frame. Not the prettiest solution, but at least it solves my problems.


 Thanks,
 Cesar Flores



 On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote:


 I have a data frame in which I load data from a hive table. And my issue
 is that the data frame is missing the columns that I need to query.

 For example:

 val newdataset = dataset.where(dataset(label) === 1)

 gives me an error like the following:

 ERROR yarn.ApplicationMaster: User class threw exception: resolved
 attributes label missing from label, user_id, ...(the rest of the fields of
 my table
 org.apache.spark.sql.AnalysisException: resolved attributes label
 missing from label, user_id, ... (the rest of the fields of my table)

 where we can see that the label field actually exist. I manage to solve
 this issue by updating my syntax to:

 val newdataset = dataset.where($label === 1)

 which works. However I can not make this trick in all my queries. For
 example, when I try to do a unionAll from two subsets of the same data
 frame the error I am getting is that all my fields are missing.

 Can someone tell me if I need to do some post processing after loading
 from hive in order to avoid this kind of errors?


 Thanks
 --
 Cesar Flores




 --
 Cesar Flores

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai

Can your code that can reproduce the problem?

On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com
 wrote:

 Hi

 As per JIRA this issue is resolved, but i am still facing this issue.

 SPARK-2734 - DROP TABLE should also uncache table


 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai

Oh, just noticed that I missed attach... Yeah, your scripts will be
helpful. Thanks!

On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda 
ar...@sigmoidanalytics.com wrote:

 Yes, i am able to reproduce the problem. Do you need the scripts to create
 the tables?

 On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai yh...@databricks.com wrote:

 Can your code that can reproduce the problem?

 On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 As per JIRA this issue is resolved, but i am still facing this issue.

 SPARK-2734 - DROP TABLE should also uncache table


 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai

For Map type column, fields['driver'] is the syntax to retrieve the map
value (in the schema, you can see fields: map). The syntax of
fields.driver is used for struct type.

On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco jc.francisc...@gmail.com
wrote:

 Hi,

 I'm new with both Cassandra and Spark and am experimenting with what Spark
 SQL can do as it will affect my Cassandra data model.

 What I need is a model that can accept arbitrary fields, similar to
 Postgres's Hstore. Right now, I'm trying out the map type in Cassandra but
 I'm getting the exception below when running my Spark SQL:

 java.lang.RuntimeException: Can't access nested field in type
 MapType(StringType,StringType,true)

 The schema I have now is:
 root
  |-- device_id: integer (nullable = true)
  |-- event_date: string (nullable = true)
  |-- fields: map (nullable = true)
  ||-- key: string
  ||-- value: string (valueContainsNull = true)

 And my Spark SQL is:
 SELECT fields from raw_device_data where fields.driver = 'driver1'

 From what I gather, this should work for a JSON based RDD
 (
 https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
 ).

 Is this not supported for a Cassandra map type?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-query-key-value-in-Map-tp22517.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai

Hi Neal,

spark.sql.shuffle.partitions is the property to control the number of tasks
after shuffle (to generate t2, there is a shuffle for the aggregations
specified by groupBy and agg.) You can use
sqlContext.setConf(spark.sql.shuffle.partitions, newNumber) or
sqlContext.sql(set spark.sql.shuffle.partitions=newNumber) to set the
value of it.

For t2.repartition(10).collect, because your t2 has 200 partitions, the
first stage had 200 tasks (for the second stage, there will be 10 tasks).
So, if you have something like val t3 = t2.repartition(10), t3 will have 10
partitions.

Thanks,

Yin

On Thu, Apr 16, 2015 at 3:04 PM, Neal Yin neal@workday.com wrote:

  I have some trouble to control number of spark tasks for a stage.  This
 on latest spark 1.3.x source code build.

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sc.getConf.get(spark.default.parallelism”)  — setup to *10*
 val t1 = hiveContext.sql(FROM SalesJan2009 select * )
 val t2 = t1.groupBy(country, state,
 city).agg(avg(price).as(aprive”))

  t1.rdd.partitions.size   —  got 2
 t2.rdd.partitions.size  —  got 200

  First questions, why does t2’s partition size becomes 200?

  Second questions, even if I do t2.repartition(10).collect,  in some
 stages, it still fires 200 tasks.

  Thanks,

  -Neal

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai

Hi Justin,

Does the schema of your data have any decimal, array, map, or struct type?

Thanks,

Yin

On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote:

 Hello,

 I have a parquet file of around 55M rows (~ 1G on disk). Performing simple
 grouping operation is pretty efficient (I get results within 10 seconds).
 However, after called DataFrame.cache, I observe a significant performance
 degrade, the same operation now takes 3+ minutes.

 My hunch is that DataFrame cannot leverage its columnar format after
 persisting in memory. But cannot find anywhere from the doc mentioning this.

 Did I miss anything?

 Thanks!

 Justin

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai

For cast, you can use selectExpr method. For example,
df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2).
Or, df.select(df(colA).cast(int), ...)

On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com
wrote:

 val df = Seq((test, 1)).toDF(col1, col2)

 You can use SQL style expressions as a string:

 df.filter(col1 IS NOT NULL).collect()
 res1: Array[org.apache.spark.sql.Row] = Array([test,1])

 Or you can also reference columns using df(colName) or quot;colName or
 col(colName)

 df.filter(df(col1) === test).collect()
 res2: Array[org.apache.spark.sql.Row] = Array([test,1])

 On Thu, Apr 2, 2015 at 7:45 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi folks, having some seemingly noob issues with the dataframe API.

 I have a DF which came from the csv package.

 1. What would be an easy way to cast a column to a given type -- my DF
 columns are all typed as strings coming from a csv. I see a schema getter
 but not setter on DF

 2. I am trying to use the syntax used in various blog posts but can't
 figure out how to reference a column by name:

 scala df.filter(customer_id!=)
 console:23: error: overloaded method value filter with alternatives:
   (conditionExpr: String)org.apache.spark.sql.DataFrame and
   (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
  cannot be applied to (Boolean)
   df.filter(customer_id!=)

 
 3. what would be the recommended way to drop a row containing a null
 value -- is it possible to do this:
 scala df.filter(customer_id IS NOT NULL)

Re: rdd.toDF().saveAsParquetFile(tachyon://host:19998/test)

2015-03-28 Thread Yin Huai

You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has
been fixed in 1.3.1, which will be released soon.

On Fri, Mar 27, 2015 at 10:42 PM, sud_self 852677...@qq.com wrote:

 spark version is 1.3.0 with tanhyon-0.6.1

 QUESTION DESCRIPTION: rdd.saveAsObjectFile(tachyon://host:19998/test)
 and
 rdd.saveAsTextFile(tachyon://host:19998/test)  succeed,   but
 rdd.toDF().saveAsParquetFile(tachyon://host:19998/test) failure.

 ERROR MESSAGE:java.lang.IllegalArgumentException: Wrong FS:
 tachyon://host:19998/test, expected: hdfs://host:8020
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
 at
 org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:465)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
 at

 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/rdd-toDF-saveAsParquetFile-tachyon-host-19998-test-tp22264.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Use pig load function in spark

2015-03-23 Thread Yin Huai

Hello Kevin,

You can take a look at our generic load function
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#generic-loadsave-functions
.

For example, you can use

val df = sqlContext.load(/myData, parquet)

To load a parquet dataset stored in /myData as a DataFrame
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframes.

You can use it to load data stored in various formats, like json (Spark
built-in), parquet (Spark built-in), avro
https://github.com/databricks/spark-avro, and csv
https://github.com/databricks/spark-csv.

Thanks,

Yin

On Mon, Mar 23, 2015 at 7:14 PM, Dai, Kevin yun...@ebay.com wrote:

  Hi, Paul



 You are right.



 The story is that we have a lot of pig load function to load our different
 data.



 And now we want to use spark to read and process these data.



 So we want to figure out a way to reuse our existing load function in
 spark to read these data.



 Any idea?



 Best Regards,

 Kevin.



 *From:* Paul Brown [mailto:p...@mult.ifario.us]
 *Sent:* 2015年3月24日 4:11
 *To:* Dai, Kevin
 *Subject:* Re: Use pig load function in spark





 The answer is Maybe, but you probably don't want to do that..



 A typical Pig load function is devoted to bridging external data into
 Pig's type system, but you don't really need to do that in Spark because it
 is (thankfully) not encumbered by Pig's type system.  What you probably
 want to do is to figure out a way to use native Spark facilities (e.g.,
 textFile) coupled with some of the logic out of your Pig load function
 necessary to turn your external data into an RDD.




   —
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/



 On Mon, Mar 23, 2015 at 2:29 AM, Dai, Kevin yun...@ebay.com wrote:

 Hi, all



 Can spark use pig’s load function to load data?



 Best Regards,

 Kevin.

Re: Date and decimal datatype not working

2015-03-23 Thread Yin Huai

To store to csv file, you can use Spark-CSV
https://github.com/databricks/spark-csv library.

On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA ab9...@att.com wrote:

  Thanks. This worked well as per your suggestions. I had to run following:

 val TABLE_A =
 sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p
 = ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)),
 BigDecimal(p(5)), BigDecimal(p(6



 Now I am stuck at another step. I have run a SQL query, where I am
 Selecting from all the fields with some where clause , TSTAMP filtered with
 date range and order by TSTAMP clause. That is running fine.



 Then I am trying to store the output in a CSV file. I am using
 saveAsTextFile(“filename”) function. But it is giving error. Can you please
 help me to write a proper syntax to store output in a CSV file?





 Thanks  Regards

 ---

 Ananda Basak

 Ph: 425-213-7092



 *From:* BASAK, ANANDA
 *Sent:* Tuesday, March 17, 2015 3:08 PM
 *To:* Yin Huai
 *Cc:* user@spark.apache.org
 *Subject:* RE: Date and decimal datatype not working



 Ok, thanks for the suggestions. Let me try and will confirm all.



 Regards

 Ananda



 *From:* Yin Huai [mailto:yh...@databricks.com]
 *Sent:* Tuesday, March 17, 2015 3:04 PM
 *To:* BASAK, ANANDA
 *Cc:* user@spark.apache.org
 *Subject:* Re: Date and decimal datatype not working



 p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
 p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
 value, you need to create BigDecimal objects from your String values.



 On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.com wrote:

   Hi All,

 I am very new in Spark world. Just started some test coding from last
 week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.

 I am having issues while using Date and decimal data types. Following is
 my code that I am simply running on scala prompt. I am trying to define a
 table and point that to my flat file containing raw data (pipe delimited
 format). Once that is done, I will run some SQL queries and put the output
 data in to another flat file with pipe delimited format.



 ***

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 import sqlContext.createSchemaRDD





 // Define row and table

 case class ROW_A(

   TSTAMP:   Long,

   USIDAN: String,

   SECNT:Int,

   SECT:   String,

   BLOCK_NUM:BigDecimal,

   BLOCK_DEN:BigDecimal,

   BLOCK_PCT:BigDecimal)



 val TABLE_A =
 sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p
 = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))



 TABLE_A.registerTempTable(TABLE_A)



 ***



 The second last command is giving error, like following:

 console:17: error: type mismatch;

 found   : String

 required: Long



 Looks like the content from my flat file are considered as String always
 and not as Date or decimal. How can I make Spark to take them as Date or
 decimal types?



 Regards

 Ananda

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yin Huai

spark.sql.shuffle.partitions only control the number of tasks in the second
stage (the number of reducers). For your case, I'd say that the number of
tasks in the first state (number of mappers) will be the number of files
you have.

Actually, have you changed spark.executor.memory (it controls the memory
for an executor of your application)? I did not see it in your original
email. The difference between worker memory and executor memory can be
found at (http://spark.apache.org/docs/1.3.0/spark-standalone.html),

SPARK_WORKER_MEMORY
Total amount of memory to allow Spark applications to use on the machine,
e.g. 1000m, 2g (default: total memory minus 1 GB); note that each
application's individual memory is configured using its
spark.executor.memory property.


On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 Actually I realized that the correct way is:

 sqlContext.sql(set spark.sql.shuffle.partitions=1000)

 but I am still experiencing the same behavior/error.

 On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote:

 Hi Yin,

 the way I set the configuration is:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext.setConf(spark.sql.shuffle.partitions,1000);

 it is the correct way right?
 In the mapPartitions task (the first task which is launched), I get again
 the same number of tasks and again the same error. :(

 Thanks a lot!

 On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com wrote:

 Hi Yin,

 thanks a lot for that! Will give it a shot and let you know.

 On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote:

 Was the OOM thrown during the execution of first stage (map) or the
 second stage (reduce)? If it was the second stage, can you increase the
 value of spark.sql.shuffle.partitions and see if the OOM disappears?

 This setting controls the number of reduces Spark SQL will use and the
 default is 200. Maybe there are too many distinct values and the memory
 pressure on every task (of those 200 reducers) is pretty high. You can
 start with 400 and increase it until the OOM disappears. Hopefully this
 will help.

 Thanks,

 Yin


 On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB each.
 The number of tasks launched is equal to the number of parquet files. Do
 you have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:

 Seems there are too many distinct groups processed in a task, which
 trigger the problem.

 How many files do your dataset have and how large is a file? Seems
 your query will be executed with two stages, table scan and map-side
 aggregation in the first stage and the final round of reduce-side
 aggregation in the second stage. Can you take a look at the numbers of
 tasks launched in these two stages?

 Thanks,

 Yin

 On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi there, I set the executor memory to 8g but it didn't help

 On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote:

 You should probably increase executor memory by setting
 spark.executor.memory.

 Full list of available configurations can be found here
 http://spark.apache.org/docs/latest/configuration.html

 Cheng


 On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:

 Hi there,

 I was trying the new DataFrame API with some basic operations on a
 parquet dataset.
 I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in
 a standalone cluster mode.
 The code is the following:

 val people = sqlContext.parquetFile(/data.parquet);
 val res = people.groupBy(name,date).
 agg(sum(power),sum(supply)).take(10);
 System.out.println(res);

 The dataset consists of 16 billion entries.
 The error I get is java.lang.OutOfMemoryError: GC overhead limit
 exceeded

 My configuration is:

 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.driver.memory6g
 spark.executor.extraJavaOptions -XX:+UseCompressedOops
 spark.shuffle.managersort

 Any idea how can I workaround this?

 Thanks a lot

Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai

Can you add your code snippet? Seems it's missing in the original email.

Thanks,

Yin

On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka ken...@gmail.com wrote:

 I'm trying to filter a DataFrame by a date column, with no luck so far.
 Here's what I'm doing:



 When I run reqs_day.count() I get zero, apparently because my date
 parameter
 gets translated to 16509.

 Is this a bug, or am I doing it wrong?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-filter-DataFrame-by-date-tp22149.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai

I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have created https://issues.apache.org/jira/browse/SPARK-6413 to track the
improvement on the output of DESCRIBE statement.

On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote:

 Hi Christian,

 Your table is stored correctly in Parquet format.

 For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
 data source table (
 http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
 We are only using Hive's metastore to store the metadata (to be specific,
 only table properties and serde properties). When you look at table
 property, there will be a field called spark.sql.sources.provider and the
 value will be org.apache.spark.sql.parquet.DefaultSource. You can also
 look at your files in the file system. They are stored by Parquet.

 Thanks,

 Yin

 On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
 wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education, income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai

Hi Christian,

Your table is stored correctly in Parquet format.

For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to be specific,
only table properties and serde properties). When you look at table
property, there will be a field called spark.sql.sources.provider and the
value will be org.apache.spark.sql.parquet.DefaultSource. You can also
look at your files in the file system. They are stored by Parquet.

Thanks,

Yin

On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education, income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai

Was the OOM thrown during the execution of first stage (map) or the second
stage (reduce)? If it was the second stage, can you increase the value
of spark.sql.shuffle.partitions and see if the OOM disappears?

This setting controls the number of reduces Spark SQL will use and the
default is 200. Maybe there are too many distinct values and the memory
pressure on every task (of those 200 reducers) is pretty high. You can
start with 400 and increase it until the OOM disappears. Hopefully this
will help.

Thanks,

Yin


On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The
 number of tasks launched is equal to the number of parquet files. Do you
 have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:

 Seems there are too many distinct groups processed in a task, which
 trigger the problem.

 How many files do your dataset have and how large is a file? Seems your
 query will be executed with two stages, table scan and map-side aggregation
 in the first stage and the final round of reduce-side aggregation in the
 second stage. Can you take a look at the numbers of tasks launched in these
 two stages?

 Thanks,

 Yin

 On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there, I set the executor memory to 8g but it didn't help

 On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote:

 You should probably increase executor memory by setting
 spark.executor.memory.

 Full list of available configurations can be found here
 http://spark.apache.org/docs/latest/configuration.html

 Cheng


 On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:

 Hi there,

 I was trying the new DataFrame API with some basic operations on a
 parquet dataset.
 I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
 standalone cluster mode.
 The code is the following:

 val people = sqlContext.parquetFile(/data.parquet);
 val res = people.groupBy(name,date).agg(sum(power),sum(supply)
 ).take(10);
 System.out.println(res);

 The dataset consists of 16 billion entries.
 The error I get is java.lang.OutOfMemoryError: GC overhead limit
 exceeded

 My configuration is:

 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.driver.memory6g
 spark.executor.extraJavaOptions -XX:+UseCompressedOops
 spark.shuffle.managersort

 Any idea how can I workaround this?

 Thanks a lot

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai

Hi Roberto,

For now, if the timestamp is a top level column (not a field in a
struct), you can use use backticks to quote the column name like `timestamp
`.

Thanks,

Yin

On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio 
roberto.coluc...@gmail.com wrote:

 Hey Cheng, thank you so much for your suggestion, the problem was actually
 a column/field called timestamp in one of the case classes!! Once I
 changed its name everything worked out fine again. Let me say it was kinda
 frustrating ...

 Roberto

 On Wed, Mar 18, 2015 at 4:07 PM, Roberto Coluccio 
 roberto.coluc...@gmail.com wrote:

 You know, I actually have one of the columns called timestamp ! This
 may really cause the problem reported in the bug you linked, I guess.

 On Wed, Mar 18, 2015 at 3:37 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  I suspect that you hit this bug
 https://issues.apache.org/jira/browse/SPARK-6250, it depends on the
 actual contents of your query.

 Yin had opened a PR for this, although not merged yet, it should be a
 valid fix https://github.com/apache/spark/pull/5078

 This fix will be included in 1.3.1.

 Cheng

 On 3/18/15 10:04 PM, Roberto Coluccio wrote:

 Hi Cheng, thanks for your reply.

  The query is something like:

  SELECT * FROM (
   SELECT m.column1, IF (d.columnA IS NOT null, d.columnA, m.column2),
 ..., m.columnN FROM tableD d RIGHT OUTER JOIN tableM m on m.column2 =
 d.columnA WHERE m.column2!=\None\ AND d.columnA!=\\
   UNION ALL
   SELECT ... [another SELECT statement with different conditions but
 same tables]
   UNION ALL
   SELECT ... [another SELECT statement with different conditions but
 same tables]
 ) a


  I'm using just sqlContext, no hiveContext. Please, note once again
 that this perfectly worked w/ Spark 1.1.x.

  The tables, i.e. tableD and tableM are previously registered with the
 RDD.registerTempTable method, where the input RDDs are actually a
 RDD[MyCaseClassM/D], with MyCaseClassM and MyCaseClassD being simple
 case classes with only (and less than 22) String fields.

  Hope the situation is a bit more clear. Thanks anyone who will help me
 out here.

  Roberto



 On Wed, Mar 18, 2015 at 12:09 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  Would you mind to provide the query? If it's confidential, could you
 please help constructing a query that reproduces this issue?

 Cheng

 On 3/18/15 6:03 PM, Roberto Coluccio wrote:

 Hi everybody,

  When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both
 1.2.0 and 1.2.1) I encounter a weird error never occurred before about
 which I'd kindly ask for any possible help.

   In particular, all my Spark SQL queries fail with the following
 exception:

  java.lang.RuntimeException: [1.218] failure: identifier expected

 [my query listed]
   ^
   at scala.sys.package$.error(package.scala:27)
   at
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
   at
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
   at
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
   at
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
   at
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
   at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
   at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
   at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
   at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
   ...



  The unit tests I've got for testing this stuff fail both if I
 build+test the project with Maven and if I run then as single ScalaTest
 files or test suites/packages.

  When running my app as usual on EMR in YARN-cluster mode, I get the
 following:

  15/03/17 11:32:14 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 15, (reason: User class threw exception: [1.218] failure: 
 identifier expected

 SELECT * FROM ... (my query)


^)
 Exception in thread Driver java.lang.RuntimeException: [1.218] failure: 
 identifier expected

 SELECT * FROM ... (my query)   


^
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 at

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai

Initially, an attribute reference (column reference), like selecting a
column from a table, is not resolved since we do not know if the reference
is valid or not (if this column exists in the underlying table). In the
query compilation process, we will first analyze this query and resolved
those attribute references. A resolved attribute reference means that this
reference is valid and we know where to get the column values from the
input. Hope this is helpful.

On Tue, Mar 17, 2015 at 2:19 PM, Ophir Cohen oph...@gmail.com wrote:

 Thanks you for the answer and one more question: what does it mean
 'resolved attribute'?
 On Mar 17, 2015 8:14 PM, Yin Huai yh...@databricks.com wrote:

 The number is an id we used internally to identify an resolved Attribute.
 Looks like basic_null_diluted_d was not resolved since there is no id
 associated with it.

 On Tue, Mar 17, 2015 at 2:08 PM, Ophir Cohen oph...@gmail.com wrote:

 Interesting, I thought the problem is with the method itself.
 I will check it soon and update.
 Can you elaborate what does it mean the # and the number? Is that a
 reference to the field in the rdd?
 Thank you,
 Ophir
 On Mar 17, 2015 7:06 PM, Yin Huai yh...@databricks.com wrote:

 Seems basic_null_diluted_d was not resolved? Can you check if
 basic_null_diluted_d is in you table?

 On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen oph...@gmail.com wrote:

 Hi Guys,
 I'm registering a function using:
 sqlc.registerFunction(makeEstEntry,ReutersDataFunctions.makeEstEntry
 _)

 Then I register the table and try to query the table using that
 function and I get:
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved attributes:
 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
 AS FY0#2837, tree:

 Thanks!
 Ophir

Re: Date and decimal datatype not working

2015-03-17 Thread Yin Huai

p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
value, you need to create BigDecimal objects from your String values.

On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.com wrote:

  Hi All,

 I am very new in Spark world. Just started some test coding from last
 week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.

 I am having issues while using Date and decimal data types. Following is
 my code that I am simply running on scala prompt. I am trying to define a
 table and point that to my flat file containing raw data (pipe delimited
 format). Once that is done, I will run some SQL queries and put the output
 data in to another flat file with pipe delimited format.



 ***

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 import sqlContext.createSchemaRDD





 // Define row and table

 case class ROW_A(

   TSTAMP:   Long,

   USIDAN: String,

   SECNT:Int,

   SECT:   String,

   BLOCK_NUM:BigDecimal,

   BLOCK_DEN:BigDecimal,

   BLOCK_PCT:BigDecimal)



 val TABLE_A =
 sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p
 = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))



 TABLE_A.registerTempTable(TABLE_A)



 ***



 The second last command is giving error, like following:

 console:17: error: type mismatch;

 found   : String

 required: Long



 Looks like the content from my flat file are considered as String always
 and not as Date or decimal. How can I make Spark to take them as Date or
 decimal types?



 Regards

 Ananda

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai

The number is an id we used internally to identify an resolved Attribute.
Looks like basic_null_diluted_d was not resolved since there is no id
associated with it.

On Tue, Mar 17, 2015 at 2:08 PM, Ophir Cohen oph...@gmail.com wrote:

 Interesting, I thought the problem is with the method itself.
 I will check it soon and update.
 Can you elaborate what does it mean the # and the number? Is that a
 reference to the field in the rdd?
 Thank you,
 Ophir
 On Mar 17, 2015 7:06 PM, Yin Huai yh...@databricks.com wrote:

 Seems basic_null_diluted_d was not resolved? Can you check if
 basic_null_diluted_d is in you table?

 On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen oph...@gmail.com wrote:

 Hi Guys,
 I'm registering a function using:
 sqlc.registerFunction(makeEstEntry,ReutersDataFunctions.makeEstEntry _)

 Then I register the table and try to query the table using that function
 and I get:
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved attributes:
 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
 AS FY0#2837, tree:

 Thanks!
 Ophir

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai

Seems basic_null_diluted_d was not resolved? Can you check if
basic_null_diluted_d is in you table?

On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen oph...@gmail.com wrote:

 Hi Guys,
 I'm registering a function using:
 sqlc.registerFunction(makeEstEntry,ReutersDataFunctions.makeEstEntry _)

 Then I register the table and try to query the table using that function
 and I get:
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes:
 'makeEstEntry(numest#20,median#21,mean#22,stddev#23,high#24,low#25,currency_#26,units#27,'basic_null_diluted_d)
 AS FY0#2837, tree:

 Thanks!
 Ophir

Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai

Seems you want to use array for the field of providers, like
providers:[{id:
...}, {id:...}] instead of providers:{{id: ...}, {id:...}}

On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 kpe...@gmail.com wrote:

 Hi All,

 I was noodling around with loading in a json file into spark sql's hive
 context and I noticed that I get the following message after loading in the
 json file:
 PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47

 I am using the HiveContext to load in the json file using the jsonFile
 command.  I also have 1 json object per line on the file.  Here is a sample
 of the contents in the json file:

 {user_id:7070,providers:{{id:8753,name:pjfig,behaviors:{b1:erwxt,b2:yjooj}},{id:8329,name:dfvhh,behaviors:{b1:pjjdn,b2:ooqsh

 {user_id:1615,providers:{{id:6105,name:rsfon,behaviors:{b1:whlje,b2:lpjnq}},{id:6828,name:pnmrb,behaviors:{b1:fjpmz,b2:dxqxk

 {user_id:5210,providers:{{id:9360,name:xdylm,behaviors:{b1:gcdze,b2:cndcs}},{id:4812,name:gxboh,behaviors:{b1:qsxao,b2:ixdzq




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Loading-in-json-with-spark-sql-tp22044.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL. Cast to Bigint

2015-03-13 Thread Yin Huai

Are you using SQLContext? Right now, the parser in the SQLContext is quite
limited on the data type keywords that it handles (see here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala#L391)
and unfortunately bigint is not handled in it right now. We will add
other data types in there (https://issues.apache.org/jira/browse/SPARK-6146
is used to track it).  Can you try HiveContext for now?

On Fri, Mar 13, 2015 at 4:48 AM, Masf masfwo...@gmail.com wrote:

 Hi.

 I have a query in Spark SQL and I can not covert a value to BIGINT:
 CAST(column AS BIGINT) or
 CAST(0 AS BIGINT)

 The output is:
 Exception in thread main java.lang.RuntimeException: [34.62] failure:
 ``DECIMAL'' expected but identifier BIGINT found

 Thanks!!
 Regards.
 Miguel Ángel

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai

Hi Nitay,

Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
struct`int`:bigint)?

Thanks,

Yin

On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:

 Thanks for reporting.  This was a result of a change to our DDL parser
 that resulted in types becoming reserved words.  I've filled a JIRA and
 will investigate if this is something we can fix.
 https://issues.apache.org/jira/browse/SPARK-6250

 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote:

 In Spark 1.2 I used to be able to do this:

 scala
 org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
 res30: org.apache.spark.sql.catalyst.types.DataType =
 StructType(List(StructField(int,LongType,true)))

 That is, the name of a column can be a keyword like int. This is no
 longer the case in 1.3:

 data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
 failure: ``'' expected but `int' found

 structint:bigint
^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at
 org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
 at
 org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

 Note HiveTypeHelper is simply an object I load in to expose
 HiveMetastoreTypes since it was made private. See
 https://gist.github.com/nitay/460b41ed5fd7608507f5
 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

 This is actually a pretty big problem for us as we have a bunch of legacy
 tables with column names like timestamp. They work fine in 1.2, but now
 everything throws in 1.3.

 Any thoughts?

 Thanks,
 - Nitay
 Founder  CTO

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai

Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai

Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1
(versions that we support). Seems
https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently.

On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass instance,
 is the CassandraAwareSQLContext a direct sub class of HiveContext or
 SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
 org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None,
 false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai

@Shahab, based on https://issues.apache.org/jira/browse/HIVE-5472,
current_date was added in Hive *1.2.0 (not 0.12.0)*. For my previous email,
I meant current_date is not in neither Hive 0.12.0 nor Hive 0.13.1 (Spark
SQL currently supports these two Hive versions).

On Tue, Mar 3, 2015 at 8:55 AM, Rohit Rai ro...@tuplejump.com wrote:

 The Hive dependency comes from spark-hive.

 It does work with Spark 1.1 we will have the 1.2 release later this month.
 On Mar 3, 2015 8:49 AM, shahab shahab.mok...@gmail.com wrote:


 Thanks Rohit,

 I am already using Calliope and quite happy with it, well done ! except
 the fact that :
 1- It seems that it does not support Hive 0.12 or higher, Am i right?
  for example you can not use : current_time() UDF, or those new UDFs added
 in hive 0.12 . Are they supported? Any plan for supporting them?
 2-It does not support Spark 1.1 and 1.2. Any plan for new release?

 best,
 /Shahab

 On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai ro...@tuplejump.com wrote:

 Hello Shahab,

 I think CassandraAwareHiveContext
 https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala
  in
 Calliopee is what you are looking for. Create CAHC instance and you should
 be able to run hive functions against the SchemaRDD you create from there.

 Cheers,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*

 On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass
 instance, is the CassandraAwareSQLContext a direct sub class of
 HiveContext or SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling,
 profile, org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None,
 None, false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Yin Huai

Is the string of the above JSON object in the same line? jsonFile requires
that every line is a JSON object or an array of JSON objects.

On Mon, Mar 2, 2015 at 11:28 AM, kpeng1 kpe...@gmail.com wrote:

 Hi All,

 I am currently having issues reading in a json file using spark sql's api.
 Here is what the json file looks like:
 {
   namespace: spacey,
   name: namer,
   type: record,
   fields: [
 {name:f1,type:[null,string]},
 {name:f2,type:[null,string]},
 {name:f3,type:[null,string]},
 {name:f4,type:[null,string]},
 {name:f5,type:[null,string]},
 {name:f6,type:[null,string]},
 {name:f7,type:[null,string]},
 {name:f8,type:[null,string]},
 {name:f9,type:[null,string]},
 {name:f10,type:[null,string]},
 {name:f11,type:[null,string]},
 {name:f12,type:[null,string]},
 {name:f13,type:[null,string]},
 {name:f14,type:[null,string]},
 {name:f15,type:[null,string]}
   ]
 }

 This is what I am doing to read in the json file(using spark sql in the
 spark shell on CDH5.3):

 val sqlsc = new org.apache.spark.sql.SQLContext(sc)
 val j = sqlsc.jsonFile(/tmp/try.avsc)


 This is what I am getting as an error:

 15/03/02 11:23:45 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 12,
 10.0.2.15): scala.MatchError: namespace (of class java.lang.String)
 at

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:305)
 at

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:303)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at

 scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
 at
 scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:853)
 at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:851)
 at
 org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1350)
 at
 org.apache.spark.SparkContext$$anonfun$29.apply(SparkContext.scala:1350)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.1 in stage 3.0 (TID
 14, 10.0.2.15, ANY, 1308 bytes)
 15/03/02 11:23:45 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
 13) in 128 ms on 10.0.2.15 (1/2)
 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.1 in stage 3.0 (TID 14)
 on executor 10.0.2.15: scala.MatchError (namespace (of class
 java.lang.String)) [duplicate 1]
 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.2 in stage 3.0 (TID
 15, 10.0.2.15, ANY, 1308 bytes)
 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.2 in stage 3.0 (TID 15)
 on executor 10.0.2.15: scala.MatchError (namespace (of class
 java.lang.String)) [duplicate 2]
 15/03/02 11:23:45 INFO TaskSetManager: Starting task 0.3 in stage 3.0 (TID
 16, 10.0.2.15, ANY, 1308 bytes)
 15/03/02 11:23:45 INFO TaskSetManager: Lost task 0.3 in stage 3.0 (TID 16)
 on executor 10.0.2.15: scala.MatchError (namespace (of class
 java.lang.String)) [duplicate 3]
 15/03/02 11:23:45 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times;
 aborting job
 15/03/02 11:23:45 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
 have all completed, from pool
 15/03/02 11:23:45 INFO TaskSchedulerImpl: Cancelling stage 3
 15/03/02 11:23:45 INFO DAGScheduler: Job 3 failed: reduce at
 JsonRDD.scala:57, took 0.210707 s
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in
 stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0
 (TID 16, 10.0.2.15): scala.MatchError: namespace (of class
 java.lang.String)
 at

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:305)
 at

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:303)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at

 scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
 at

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai

Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been
fixed in 1.2.1 and 1.3.

On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Can someone confirm if they can run UDFs in group by in spark1.2?

 I have two builds running -- one from a custom build from early December
 (commit 4259ca8dd12) which works fine, and Spark1.2-RC2.

 On the latter I get:

  jdbc:hive2://XXX.208:10001 select 
 from_unixtime(epoch,'-MM-dd-HH'),count(*) count
 . . . . . . . . . . . . . . . . . . from tbl
 . . . . . . . . . . . . . . . . . . group by 
 from_unixtime(epoch,'-MM-dd-HH');
 Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 Expression not in GROUP BY: 
 HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
  AS _c0#1004, tree:
 Aggregate 
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)],
  
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
  AS _c0#1004,COUNT(1) AS count#1003L]
  MetastoreRelation default, tbl, None (state=,code=0)

 

 This worked fine on my older build. I don't see a JIRA on this but maybe
 I'm not looking right. Can someone please advise?

Re: SQLContext.applySchema strictness

2015-02-13 Thread Yin Huai

Hi Justin,

It is expected. We do not check if the provided schema matches rows since
all rows need to be scanned to give a correct answer.

Thanks,

Yin

On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 Per the documentation:

   It is important to make sure that the structure of every Row of the
 provided RDD matches the provided schema. Otherwise, there will be runtime
 exception.

 However, it appears that this is not being enforced.

 import org.apache.spark.sql._
 val sqlContext = new SqlContext(sc)
 val struct = StructType(List(StructField(test, BooleanType, true)))
 val myData = sc.parallelize(List(Row(0), Row(true), Row(stuff)))
 val schemaData = sqlContext.applySchema(myData, struct) //No error
 schemaData.collect()(0).getBoolean(0) //Only now will I receive an error

 Is this expected or a bug?

 Thanks,
 Justin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai

Regarding backticks: Right. You need backticks to quote the column name
timestamp because timestamp is a reserved keyword in our parser.

On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:

 actually i tried in spark shell , got same error and then for some reason
 i tried to back tick the timestamp and it worked.
  val result = sqlContext.sql(select toSeconds(`timestamp`) as t,
 count(rid) as qps from blah group by toSeconds(`timestamp`),qi.clientName)

 so, it seems sql context is supporting UDF.



 On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com
 wrote:

 The simple SQL parser doesn't yet support UDFs.  Try using a HiveContext.

 On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani 
 mohnish.kodn...@gmail.com wrote:

 Hi,
 I am trying a very simple registerFunction and it is giving me errors.

 I have a parquet file which I register as temp table.
 Then I define a UDF.

 def toSeconds(timestamp: Long): Long = timestamp/10

 sqlContext.registerFunction(toSeconds, toSeconds _)

 val result = sqlContext.sql(select toSeconds(timestamp) from blah);
 I get the following error.
 java.lang.RuntimeException: [1.18] failure: ``)'' expected but
 `timestamp' found

 select toSeconds(timestamp) from blah

 My end goal is as follows:
 We have log file with timestamps in microseconds and I would like to
 group by entries with second level precision, so eventually I want to run
 the query
 select toSeconds(timestamp) as t, count(x) from table group by t,x

Re: Can we execute create table and load data commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai

org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and
0.13.1. Can you change the setting of hive.security.authorization.manager
to someone accepted by 0.12 or 0.13.1?

On Thu, Feb 5, 2015 at 11:40 PM, guxiaobo1982 guxiaobo1...@qq.com wrote:

 Hi,

 I am playing with the following example code:

 public class SparkTest {

 public static void main(String[] args){

  String appName= This is a test application;

  String master=spark://lix1.bh.com:7077;

   SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

  JavaSparkContext sc = new JavaSparkContext(conf);

   JavaHiveContext sqlCtx = new
 org.apache.spark.sql.hive.api.java.JavaHiveContext(sc);

  //sqlCtx.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING));

 //sqlCtx.sql(LOAD DATA LOCAL INPATH
 '/opt/spark/examples/src/main/resources/kv1.txt' INTO TABLE src);

  // Queries are expressed in HiveQL.

  ListRow rows = sqlCtx.sql(FROM src SELECT key, value).collect();

  //ListRow rows = sqlCtx.sql(show tables).collect();

   System.out.print(I got  + rows.size() +  rows \r\n);

  sc.close();

 }}

 With the create table and load data commands commented out, the query
 command can be executed successfully, but I come to ClassNotFoundExceptions
 if these two commands are executed inside HiveContext, even with different
 error messages,

 The create table command will cause the following:




 Exception in thread main
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Hive
 Internal Error:
 java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook)

 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)

 at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)

 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(
 NativeCommand.scala:35)

 at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(
 NativeCommand.scala:35)

 at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

 at org.apache.spark.sql.hive.execution.NativeCommand.execute(
 NativeCommand.scala:30)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(
 SQLContext.scala:425)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(
 SQLContext.scala:425)

 at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

 at org.apache.spark.sql.api.java.JavaSchemaRDD.init(
 JavaSchemaRDD.scala:42)

 at org.apache.spark.sql.hive.api.java.JavaHiveContext.sql(
 JavaHiveContext.scala:37)

 at com.blackhorse.SparkTest.main(SparkTest.java:24)

 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook
 called

 [delete Spark local dirs] DEBUG org.apache.spark.storage.DiskBlockManager
 - Shutdown hook called

 The load data command will cause the following:



 Exception in thread main
 org.apache.spark.sql.execution.QueryExecutionException: FAILED:
 RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory

 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)

 at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)

 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)

 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)

 at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)

 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)

 at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)

 at
 org.apache.spark.sql.api.java.JavaSchemaRDD.init(JavaSchemaRDD.scala:42)

 at
 org.apache.spark.sql.hive.api.java.JavaHiveContext.sql(JavaHiveContext.scala:37)

 at com.blackhorse.SparkTest.main(SparkTest.java:25)

 [delete Spark local dirs] DEBUG org.apache.spark.storage.DiskBlockManager
 - Shutdown hook called

 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook
 called

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai

Can you try using backticks to quote the field name? Like `f:price`.

On Tue, Feb 10, 2015 at 5:47 AM, presence2001 neil.andra...@thefilter.com
wrote:

 Hi list,

 I have some data with a field name of f:price (it's actually part of a JSON
 structure loaded from ElasticSearch via elasticsearch-hadoop connector, but
 I don't think that's significant here). I'm struggling to figure out how to
 express that in a Spark SQL SELECT statement without generating an error
 (and haven't been able to find any similar examples in the documentation).

 val productsRdd = sqlContext.sql(SELECT
 Locales.Invariant.Metadata.item.f:price FROM products LIMIT 10)

 gives me the following error...

 java.lang.RuntimeException: [1.41] failure: ``UNION'' expected but `:'
 found

 Changing the column name is one option, but I have other systems depending
 on this right now so it's not a trivial exercise. :(

 I'm using Spark 1.2.

 Thanks in advance for any advice / help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Column-name-including-a-colon-in-a-SELECT-clause-tp21576.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai

Hello Michael,

In Spark SQL, we have our internal concepts of Output Partitioning
(representing the partitioning scheme of an operator's output) and Required
Child Distribution (representing the requirement of input data distribution
of an operator) for a physical operator. Let's say we have two operators,
parent and child, and the parent takes the output of the child as its
input. At the end of query planning process, whenever the Output
Partitioning of the child does not satisfy the Required Child Distribution
of the parent, we will add an Exchange operator between the parent and
child to shuffle the data. Right now, we do not record the partitioning
scheme of an input table. So, I think even if you use partitionBy (or
DISTRIBUTE BY in SQL) to prepare your data, you still will see the Exchange
operator and your GROUP BY operation will be executed in a new stage (after
the Exchange).

Making Spark SQL aware of the partitioning scheme of input tables is a
useful optimization. I just created
https://issues.apache.org/jira/browse/SPARK-5354 to track it.

Thanks,

Yin

On Wed, Jan 21, 2015 at 8:43 AM, Michael Davies
michael.belldav...@gmail.com wrote:

Hi Cheng,

Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(
1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema)
then Spark SQL will know that an SQL “group by” on Customer Code will not
have to shuffle?

But the prepared will have already shuffled so we pay an upfront cost for
future groupings (assuming we cache I suppose)

Mick

On 20 Jan 2015, at 20:44, Cheng Lian lian.cs@gmail.com wrote:

First of all, even if the underlying dataset is partitioned as expected,
a shuffle can’t be avoided. Because Spark SQL knows nothing about the
underlying data distribution. However, this does reduce network IO.

You can prepare your data like this (say CustomerCode is a string field
with ordinal 1):

val schemaRdd = sql(...)val schema = schemaRdd.schemaval prepared =
schemaRdd.keyBy(_.getString(1)).partitionBy(new
HashPartitioner(n)).values.applySchema(schema)

n should be equal to spark.sql.shuffle.partitions.

Cheng

On 1/19/15 7:44 AM, Mick Davies wrote:

Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.

Having done this I would then hope that GROUP BY could avoid shuffle

E.g. set up a HashPartioner on CustomerCode field so that

SELECT CustomerCode, SUM(Cost)
FROM Orders
GROUP BY CustomerCode

would not need to shuffle.

Cheers
Mick

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Yin Huai

Yeah, it's a bug. It has been fixed by
https://issues.apache.org/jira/browse/SPARK-3891 in master.

On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looking at the source code for AbstractGenericUDAFResolver, the following
 (non-deprecated) method should be called:

   public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info)

 It is called by hiveUdfs.scala (master branch):

 val parameterInfo = new
 SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false)
 resolver.getEvaluator(parameterInfo)

 FYI

 On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 The following SQL query

 select percentile_approx(variables.var1, 0.95) p95
 from model

 will throw

 ERROR SparkSqlInterpreter: Error
 org.apache.hadoop.hive.ql.parse.SemanticException: This UDAF does not
 support the deprecated getEvaluator() method.
 at
 org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:53)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector$lzycompute(hiveUdfs.scala:196)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector(hiveUdfs.scala:195)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.dataType(hiveUdfs.scala:203)
 at
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:105)
 at
 org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)
 at
 org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)

 I'm using latest branch-1.2

 I found in PR that percentile and percentile_approx are supported. A bug?

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai

Hello Jianshi,

You meant you want to convert a Map to a Struct, right? We can extract some
useful functions from JsonRDD.scala, so others can access them.

Thanks,

Yin

On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 I checked the source code for inferSchema. Looks like this is exactly what
 I want:

   val allKeys = rdd.map(allKeysWithValueTypes).reduce(_ ++ _)

 Then I can do createSchema(allKeys).

 Jianshi

 On Sun, Dec 7, 2014 at 2:50 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm..

 I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782

 Jianshi

 On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD?

 I'm currently converting each Map to a JSON String and do
 JsonRDD.inferSchema.

 How about adding inferSchema support to Map[String, Any] directly? It
 would be very useful.

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive
table. Hive does not have the notion of containsNull for array values.
So, for a Hive table, the containsNull will be always true for an array and
we should ignore this field for Hive. This issue has been fixed by
https://issues.apache.org/jira/browse/SPARK-4245, which will be released
with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.com
wrote:

 After playing around with this a little more, I discovered that:

 1. If test.json contains something like {values:[null,1,2,3]}, the
 schema auto-determined by SchemaRDD.jsonFile() will have element: integer
 (containsNull = true), and then
 SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
 makes sense but doesn't really help).
 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
 StructType(Seq(StructField(values, ArrayType(IntegerType, true),
 true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
 work, though as I mentioned before, this is less than ideal.

 Why don't saveAsTable/insertInto work when the containsNull properties
 don't match?  I can understand how inserting data with containsNull=true
 into a column where containsNull=false might fail, but I think the other
 way around (which is the case here) should work.

 ~ Jonathan


 On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote:

 I've noticed some strange behavior when I try to use
 SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
 that contains elements with nested arrays.  For example, with a file
 test.json that contains the single line:
 
{values:[1,2,3]}
 
 and with code like the following:
 
 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 scala val test = sqlContext.jsonFile(test.json)
 scala test.saveAsTable(test)
 
 it creates the table but fails when inserting the data into it.  Here¹s
 the exception:
 
 scala.MatchError: ArrayType(IntegerType,true) (of class
 org.apache.spark.sql.catalyst.types.ArrayType)
at
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
 2
 47)
at
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
at
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
at
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
 a
 :84)
at
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
 l
 y(Projection.scala:66)
at
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
 l
 y(Projection.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$s
 q
 l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
 a
 la:149)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
 e
 File$1.apply(InsertIntoHiveTable.scala:158)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
 e
 File$1.apply(InsertIntoHiveTable.scala:158)
at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
 1
 145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
 :
 615)
at java.lang.Thread.run(Thread.java:745)
 
 I'm guessing that this is due to the slight difference in the schemas of
 these tables:
 
 scala test.printSchema
 root
  |-- values: array (nullable = true)
  ||-- element: integer (containsNull = false)
 
 
 scala sqlContext.table(test).printSchema
 root
  |-- values: array (nullable = true)
  ||-- element: integer (containsNull = true)
 
 If I reload the file using the schema that was created for the Hive table
 then try inserting the data into the table, it works:
 
 scala sqlContext.jsonFile(file:///home/hadoop/test.json,
 sqlContext.table(test).schema).insertInto(test)
 scala sqlContext.sql(select * from test).collect().foreach(println)
 [ArrayBuffer(1, 2, 3)]
 
 Does this mean that there is a bug with how the schema is being
 automatically determined when you use HiveContext.jsonFile() for JSON
 files that contain nested arrays?  (i.e., should containsNull be true for
 the array elements?)  Or is there a bug with how the Hive table is created
 from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
 probably get around this by defining the schema myself rather than using
 auto-detection, but for now I¹d like to use auto-detection.
 
 By the way, I'm using Spark 1.1.0.
 
 Thanks,

Re: can't get smallint field from hive on spark

2014-11-26 Thread Yin Huai

For hive on spark, did you mean the thrift server of Spark SQL or
https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter
one, I think Hive's mailing list will be a good place to ask (see
https://hive.apache.org/mailing_lists.html).

Thanks,

Yin

On Wed, Nov 26, 2014 at 10:49 PM, 诺铁 noty...@gmail.com wrote:

 thank you very much.

 On Thu, Nov 27, 2014 at 11:30 AM, Michael Armbrust mich...@databricks.com
  wrote:

 This has been fixed in Spark 1.1.1 and Spark 1.2

 https://issues.apache.org/jira/browse/SPARK-3704

 On Wed, Nov 26, 2014 at 7:10 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 don't know whether this question should be asked here, if not, please
 point me out, thanks.

 we are currently using hive on spark, when reading a small int field, it
 reports error:
 Cannot get field 'i16Val' because union is currently set to i32Val

 I googled and find only source code of
 TColumnValue.java
 https://svn.apache.org/repos/asf/hive/trunk/service/src/gen/thrift/gen-javabean/org/apache/hive/service/cli/thrift/TColumnValue.java
   where the exception is thrown

 don't know where to go, please help

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Yin Huai

I guess you want to use split(\\|) instead of split(|).

On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote:

 Which version are you using? Or if you are using the most recent master or
 branch-1.2, which commit are you using?


 On 11/25/14 4:08 PM, david wrote:

 Hi,

   I have 2 files which come from csv import of 2 Oracle tables.

   F1 has 46730613 rows
   F2 has   3386740 rows

 I build 2 tables with spark.

 Table F1 join with table F2 on c1=d1.


 All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows.
 But
 it returns only 3437  rows

 // --- begin code ---

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD


 val rddFile = sc.textFile(hdfs://referential/F1/part-*)
 case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
 val stkrdd = rddFile.map(x = x.split(|)).map(f =
 F1(f(44),f(3),f(10).toDouble, ,f(2)))
 stkrdd.registerAsTable(F1)
 sqlContext.cacheTable(F1)


 val prdfile = sc.textFile(hdfs://referential/F2/part-*)
 case class F2(d1: String, d2:String, d3:String,d4:String)
 val productrdd = prdfile.map(x = x.split(|)).map(f =
 F2(f(0),f(2),f(101),f(3)))
 productrdd.registerAsTable(F2)
 sqlContext.cacheTable(F2)

 val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 =
 F2.d1
 ).count()

 // --- end of code ---


 Does anybody know what i missed ?

 Thanks






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-
 that-expected-tp19731.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to deal with BigInt in my case class for RDD = SchemaRDD convertion

2014-11-21 Thread Yin Huai

Hello Jianshi,

The reason of that error is that we do not have a Spark SQL data type for
Scala BigInt. You can use Decimal for your case.

Thanks,

Yin

On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 I got an error during rdd.registerTempTable(...) saying scala.MatchError:
 scala.BigInt

 Looks like BigInt cannot be used in SchemaRDD, is that correct?

 So what would you recommend to deal with it?

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: Converting a json struct to map

2014-11-19 Thread Yin Huai

Oh, actually, we do not support MapType provided by the schema given to
jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of
4476 (I should have one soon).

Thanks,

Yin

On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv danielru...@gmail.com wrote:

 Thank you Michael
 I will try it out tomorrow

 Daniel

 On 19 בנוב׳ 2014, at 21:07, Michael Armbrust mich...@databricks.com
 wrote:

 You can override the schema inference by passing a schema as the second
 argument to jsonRDD, however thats not a super elegant solution.  We are
 considering one option to make this easier here:
 https://issues.apache.org/jira/browse/SPARK-4476

 On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Something like this?

val map_rdd = json_rdd.map(json = {
   val mapper = new ObjectMapper() with ScalaObjectMapper
   mapper.registerModule(DefaultScalaModule)

   val myMap = mapper.readValue[Map[String,String]](json)

   myMap
 })

 Thanks
 Best Regards

 On Wed, Nov 19, 2014 at 11:01 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Hi,
 I'm loading a json file into a RDD and then save that RDD as parquet.
 One of the fields is a map of keys and values but it is being translated
 and stored as a struct.

 How can I convert the field into a map?


 Thanks,
 Daniel

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai

Hello Brian,

Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.

Thanks,

Yin

On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote:

 I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
 I'd like some of the json fields to be in a MapType rather than a sub
 StructType, as the keys will be very sparse.

 For example:
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.createSchemaRDD
  val jsonRdd = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
{key: 4321,
 attributes: {location: nyc}}))
  val schemaRdd = sqlContext.jsonRDD(jsonRdd)
  schemaRdd.printSchema
 root
  |-- attributes: struct (nullable = true)
  ||-- gender: string (nullable = true)
  ||-- location: string (nullable = true)
  |-- key: string (nullable = true)
  schemaRdd.collect
 res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
 [[null,nyc],4321])


 However this isn't what I want.  So I created my own StructType to pass to
 the jsonRDD call:

  import org.apache.spark.sql._
  val st = StructType(Seq(StructField(key, StringType, false),
StructField(attributes,
 MapType(StringType, StringType, false
  val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
   {key: 4321,
 attributes: {location: nyc}}))
  val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
  schemaRddSt.printSchema
 root
  |-- key: string (nullable = false)
  |-- attributes: map (nullable = true)
  ||-- key: string
  ||-- value: string (valueContainsNull = false)
  schemaRddSt.collect
 ***  Failure  ***
 scala.MatchError: MapType(StringType,StringType,false) (of class
 org.apache.spark.sql.catalyst.types.MapType)
 at
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
 ...

 The schema of the schemaRDD is correct.  But it seems that the json cannot
 be coerced to a MapType.  I can see at the line in the stack trace that
 there is no case statement for MapType.  Is there something I'm missing?
 Is
 this a bug or decision to not support MapType with json?

 Thanks,
 Brian




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai

Hello Kevin,

https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug.

Thanks,

Yin

On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Which version are you using? I can reproduce that in the latest code, but
 with different exception.
 I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can
 you also add some information there?

 Thanks,
 Cheng Hao

 -Original Message-
 From: Kevin Paul [mailto:kevinpaulap...@gmail.com]
 Sent: Thursday, November 6, 2014 7:09 AM
 To: user
 Subject: [SQL] PERCENTILE is not working

 Hi all, I encounter this error when execute the query

 sqlContext.sql(select percentile(age, array(0, 0.5, 1)) from
 people).collect()

 java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer
 cannot be cast to [Ljava.lang.Object;

 at
 org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)

 at
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)

 at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)

 at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)

 at org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)

 at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:135)

 at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)

 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 Thanks,
 Kevin Paul

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai

Hello Tridib,

For you case, you can use StructType(StructField(ParentInfo, parentInfo,
true) :: StructField(ChildInfo, childInfo, true) :: Nil) to create the
StructType representing the schema (parentInfo and childInfo are two
existing StructTypes). You can take a look at our docs (
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema
 and
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package
).

Thanks,

Yin

On Tue, Nov 4, 2014 at 3:19 PM, tridib tridib.sama...@live.com wrote:

 I am trying to create a schema which will look like:

 root
  |-- ParentInfo: struct (nullable = true)
  ||-- ID: string (nullable = true)
  ||-- State: string (nullable = true)
  ||-- Zip: string (nullable = true)
  |-- ChildInfo: struct (nullable = true)
  ||-- ID: string (nullable = true)
  ||-- State: string (nullable = true)
  ||-- Hobby: string (nullable = true)
  ||-- Zip: string (nullable = true)

 How do I create a StructField of StructType? I think that's what the root
 is.

 Thanks  Regards
 Tridib



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-nested-schema-tp18090.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

1 2 >

1 - 100 of 160 matches

Mail list logo